Abstract: Simply measuring a language model’s performance after deployment is not enough. Rigorous and inclusive benchmarking isn’t just a checkbox along the way—it’s the foundation upon which ...