← Back to blog
Benchmarks

LLM Benchmarks: A Reality Check

Most benchmark comparisons are misleading. Here's how to read them without getting fooled.

By AI Signal Editorial

Every model launch ships with a benchmark table where the new model is in bold and a percentage point or two ahead of the competition. Almost none of these tables tell you what you actually need to know to pick a model for your workload.

Contamination is everywhere

Public benchmarks leak into pre-training corpora. Sometimes it is honest accident — the test set was scraped along with everything else. Sometimes it is less honest. Either way, headline numbers on MMLU, GSM8K, HumanEval, and their successors have to be treated as upper bounds, not as estimates of generalisation. The clean way to compare models is on a private benchmark you built yourself, scored consistently across vendors. Yes, that is expensive. It is also the only thing that works.

Pass-at-k is not pass-at-one

A model that solves a coding task on the fifth sample is not the same model your users will experience. Most published numbers are pass-at-k for some k greater than one, often combined with chain-of-thought sampling at temperature. For a production system that takes one shot per request, the relevant number is pass-at-1, and the gap between pass-at-1 and pass-at-5 can be twenty points.

Latency is part of quality

A model that is two points ahead on a reasoning benchmark but three times slower at the 99th percentile is, for most products, the worse model. Benchmark tables never include latency. Build your own table that does, and weight it by the actual response-time budget your product allows. The model rankings will change.

// Example: a minimal RAG retrieval step
const hits = await vectorStore.search(queryEmbedding, { topK: 50 });
const reranked = await crossEncoder.rerank(query, hits);
const context = reranked.slice(0, 6).map(h => h.text).join("\n---\n");