This article tests if AI models can fairly judge other AI models, or if bias, labels, and model type quietly change the results.
Many teams use AI models to score or rank other AI models. The problem is that these AI judges may be biased. They might favor their own company, prefer slower thinking models, or change decisions when model names are shown. This makes benchmarks and comparisons unreliable.
The authors built a clear and repeatable system to test AI judges using MT-Bench, a popular benchmark. They compared models from three vendors across two tiers fast and thinking and studied how judges behave when answers are anonymous or labeled.
They ran three main experiments. First, they checked blind judging to see if bias exists even when model names are hidden. Second, they tested bias across different task types like writing, math, and coding. Third, they showed judges different identity hints to see how labels change outcomes.
The results show that bias is real. Some judges strongly favor their own vendor. Thinking models usually score higher than fast ones. Showing model names changes decisions, but no hinting setup fully removes bias. In many cases, GPT answers win even when judged by non GPT models, which suggests both real quality and amplified bias.