llm-as-judge

Here are 7 public repositories matching this topic...

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

StanfordMIMI / MedVAL

Star

Toward Expert-Level Medical Text Validation with Language Models

medical-text llm-as-judge

Updated Oct 23, 2025
Python

johnsonfarmsus / openwebui-ab-mcts-pipeline

Star

Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration

docker machine-learning ai pipeline multi-model monte-carlo-tree-search research-software sakana llm reasoning-engine open-webui llm-as-judge advanced-reasoning open-webui-tools ab-mcts

Updated Oct 10, 2025
Python

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

Star

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

reinforcement-learning machine-learning-algorithms language-model reward-design rft ai-training deeplearning-ai-courses ai-optimization multi-step-reasoning ai-evaluation rlhf llm-fine-tuning opensource-ai llm-as-judge predibase grpo llm-development token-level-control

Updated Jun 13, 2025
Jupyter Notebook

anaishowland / llm-judge-psai

Star

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

computer-vision evaluation evaluation-metrics performance-testing evaluation-framework web-agent llm-as-judge llm-as-a-judge llm-as-evaluator computer-use