Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Published in International Conference on Learning Representations (ICLR) 2025, 2024

This paper examines whether preferences indicated by LLM judges correspond to tangible improvements in alignment metrics. It introduces SOS-Bench (Substance Outweighs Style Benchmark), the largest standardized LLM meta-benchmark to date, revealing that LLM-judge preferences often prioritize style over factual accuracy and safety. The study emphasizes the critical role of supervised fine-tuning in post-training for enhancing alignment, highlighting the importance of data scaling and prompt diversity.

Recommended citation: Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson. (2025). "Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking." International Conference on Learning Representations (ICLR) 2025.
Download Paper

Share on

Twitter Facebook LinkedIn