Evaluating AI models

11 min read · Reviewed May 2026 · benchmarks

Leaderboards are not products. A model that ranks first on MMLU may underperform on your domain; a model that sits mid-pack on the HELM family (Stanford CRFM) may be exactly what your support chat needs. The work of evaluation is bridging two things that look similar but aren't: capability benchmarks (can the model do X in isolation?) and product evaluations (does the model deliver Y, on YOUR task, with YOUR users?). Both are needed; neither replaces the other.

READ TIME11 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNThe benchmark families that actually matter in 2026, what each one measures, and the gap between leaderboard rank and product-fit that has stayed real.

The benchmark families that matter

Seven benchmark families mapped against what they actually measure — factual knowledge, reasoning, multi-modality, agentic tasks, and robustness.

Every benchmark family measures a different slice of model capability. MMLU and MMLU-Pro test factual knowledge across 57 domains — they are the most widely cited because they are easy to run and interpret, not because they are the most useful for product decisions. The HELM family (Stanford CRFM) covers factuality, robustness, fairness, and toxicity across a fixed scenario set; its sub-variants extend to specific modalities. BIG-Bench Hard isolates the challenging reasoning subset of BIG-Bench where models historically struggle. GPQA Diamond is a graduate-level science questions set — one of the few benchmarks where frontier model performance remains well below human expert level. Inspect AI covers coding, agentic tasks, reasoning, knowledge, and multi-modal understanding, with runs that are reproducible and auditable. ARC-AGI measures abstract reasoning; SWE-bench measures real-world software engineering via repository issue resolution.

The matrix below maps each benchmark family against five capability dimensions. "Full" indicates the benchmark primarily and directly tests that dimension; "partial" indicates it touches the dimension but is not designed around it; "none" indicates the benchmark does not cover it.

What each benchmark family actually measures, May 2026

HELM as a family, not a benchmark

Stanford CRFM's HELM has grown from one benchmark into a family spanning vision, audio, finance, and more — pick the variant that matches your modality.

The HELM family (Stanford CRFM) is one of the more consequential benchmark developments of the past three years. What started as a single evaluation covering factuality, robustness, fairness, bias, and toxicity across a fixed scenario set has expanded into a benchmark family: HELM core (the original), HEIM for text-to-image models, VHELM for vision-language models, HELM Audio for audio-language understanding, ToRR for table reasoning, HELM Finance for financial document tasks, and specialised variants for long-context tasks, Arabic language, and robotic reward modelling (RoboRewardBench).

The practitioner consequence is important: "HELM rank" has become ambiguous. A model can be strong on HELM core and weak on HELM Finance. When someone cites "HELM performance" without specifying the variant, the number is almost uninterpretable for product decisions. Pick the HELM variant that matches your modality and task domain. Do not quote a composite "HELM rank" as if it were a single number.

Inspect AI — evaluation as code

Co-developed by the UK AI Security Institute and Meridian Labs — open framework for reproducible frontier model evaluations with auditable results.

Inspect AI is co-developed by the UK AI Security Institute (AISI — renamed from UK AI Safety Institute in early 2025, a research organisation within the UK Department of Science, Innovation and Technology) and Meridian Labs. It is an open framework for frontier AI evaluations covering coding, agentic tasks, reasoning, knowledge, behaviour, and multi-modal understanding.

The design philosophy is evaluation-as-code: tasks are Python definitions, runs are reproducible, and results are stored in a structured, inspectable format. This makes Inspect AI particularly suited to regulated contexts (results are auditable and version-controlled) and to research teams who need to share and reproduce evaluation results across organisations.

The task definition structure is straightforward: a task specifies a dataset, a solver (typically a chain-of-thought plus generate step), and a scorer. The example below shows the pattern — a knowledge Q&A task using model-graded factual scoring.

# Inspect AI task definition — evaluation-as-code pattern
# co-developed by UK AI Security Institute and Meridian Labs
from inspect_ai import task, Task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate

@task
def knowledge_qa() -> Task:
    return Task(
        dataset=example_dataset("knowledge_qa"),
        solver=[chain_of_thought(), generate()],
        scorer=model_graded_fact(),
    )

# Run: inspect eval knowledge_qa.py --model openai/gpt-4o

Inspect AI task definition — evaluation-as-code pattern (UK AI Security Institute + Meridian Labs)

lm-eval-harness and reproducibility

EleutherAI's lm-evaluation-harness is the de facto standard for reproducing published benchmark numbers — version and seed pinning are non-negotiable.

EleutherAI's lm-evaluation-harness is the de facto open-source standard for reproducing published benchmark results across most of the families covered above. Most benchmark numbers published in research papers from 2024 through 2026 are reproducible via lm-eval-harness, and many model cards cite it explicitly.

The version-pinning discipline matters more than it sounds. Prompt formats, tokenisation handling, and scoring logic all differ across lm-eval-harness versions. A paper that publishes a benchmark number without specifying the harness version, random seed, and prompt format is publishing a number that is barely reproducible in practice — not fraudulent, just underspecified. When citing benchmark results, always pin version and seed. When comparing two models, ensure both numbers used the same harness version and identical prompt formats.

Capability vs product evaluation

Capability evals measure what a model can do in isolation; product evals measure whether your integration delivers — both are needed.

Capability evaluations answer: can this model do X in isolation, at a controlled temperature, on a clean benchmark dataset? They are useful for model selection, for tracking improvement across versions, and for communicating model fitness to non-technical stakeholders.

Product evaluations answer: does this model, inside my system, with my prompts, on my users' actual inputs, deliver the outcome I need? They require a domain-specific dataset, a scoring rubric that captures your quality criteria, and a feedback loop that refreshes the dataset as inputs evolve.

The gap between these two evaluation types is the most persistent source of model-selection errors. A team that evaluates by leaderboard rank and discovers the mismatch in production has skipped the product evaluation step. A team that only runs product evals and ignores capability benchmarks may miss a stronger model released last month.

Capability evals tell you whether a model is worth integrating. Product evals tell you whether your integration is worth shipping. Both are needed; neither replaces the other.

Reading leaderboards honestly

Top-3 rank differences are often within margin of error — domain-specific validation is the signal that matters for product decisions.

Most public leaderboards show closely clustered scores at the top. The difference between rank 1 and rank 5 on MMLU is frequently within the margin of error for different prompt formats, sampling temperatures, and harness versions. Treating leaderboard rank as a procurement signal without domain validation is one of the more common evaluation mistakes in production AI.

Dataset contamination compounds this. 'State-of-the-art' benchmark claims sometimes involve models trained on, or fine-tuned adjacent to, the benchmark test set. The contamination is not always intentional or discoverable, but its effect — models that score well on the benchmark and underperform on unseen inputs — is well-documented.

The signal that matters for product decisions is relative performance on your domain, not absolute leaderboard rank. If your use case is financial document summarisation, run a sample of your actual documents through the top-3 candidates and score against your criteria. That comparison will tell you more than any leaderboard position.

// WARNING

Treat leaderboards as a starting filter, not a procurement document. The difference between rank 1 and rank 5 on MMLU is often smaller than the difference between rank 1 on MMLU and good-enough on your domain. Run a domain sample before committing to a model.