来源:小红书笔记《trust judge如何提升LLM as Judge的质量》里提到的论文。
论文信息
- 标题:Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
- arXiv:
- DOI:
- 代码:
- 作者:Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
- 领域:Artificial Intelligence / Computation and Language
Abstract(原文)
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.
小红书笔记提炼
这条笔记把论文核心压成了两点:
Motivation
LLM as Judge 常见两类矛盾:
- score 和 comparison 不一致
- 单点打分更低的回答,pairwise 里反而赢了
- pairwise 不满足传递性
- 出现
A > B > C > A
- 出现
Method
- 单点评分更细化
- 从 5 分制扩到 100 分粒度
- 取 1~100 的概率分布,再算期望分数
- pairwise 更稳健
- 用双向顺序下的困惑度/概率比较
- 或者多轮比较后聚合结果
快速备注
- 这篇更像 评测 protocol 改进,不是新模型结构
- 优点是 不需要额外训练 Judge
- 缺点是 推理成本更高,大规模在线用起来会比较贵
配套解读
- 通俗解说: