Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
- 
            Updated
            Feb 16, 2024 
- Jupyter Notebook
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Toward Expert-Level Medical Text Validation with Language Models
Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration
The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.
Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.
Quantifying uncertainty in LLM-as-judge evals with conformal prediction.
Evaluate translations by either a self-hosted Embedder or using Chat-GPT as LLM-as-judge.
Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."