Eval platforms and tooling

12 min read · Reviewed May 2026 · eval tooling

If you went shopping for 'an eval platform' in 2024, you'd compare LangSmith, Braintrust, Patronus, and Promptfoo as roughly peer offerings in the same category. In 2026, that comparison no longer makes sense. LangSmith and Braintrust have moved to observability-first positioning. Patronus AI has pivoted up the stack to world-simulation and expert evaluation. Promptfoo has moved sideways to AI security testing. The pure 'eval platform' category has largely dissolved in two years — and most write-ups haven't caught up. What follows is honest about what each tool actually is in May 2026.

READ TIME12 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNPARTIAL — vendor cluster moving fast. The eval-platform category as it existed in 2024 has largely been absorbed into AI observability. What "an eval platform" means in 2026 vs 2024, which tools to pick and when. Score state: partial — eval-platform vendor landscape shifting fast. Last reviewed May 2026; next revisit August 2026.

The category shift

Observability absorbed evals: four tools that competed as eval peers in 2024 have each moved to a different category by 2026.

The eval-platform category of 2024 had four reasonably peer offerings: LangSmith (LangChain's evaluation and observability tooling), Braintrust (a purpose-built eval workflow platform), Patronus (an LLM evaluation platform), and Promptfoo (an open-source eval framework). They competed on features, pricing, and integrations within a recognisable category.

By May 2026, each has moved. LangSmith is now LangSmith Observability — an AI agent observability platform where evals are a feature within the observability workflow, not the headline product. Braintrust positions as an 'AI observability platform for building quality AI products', has raised a Series B ($80M), and built a custom 'Brainstore' database optimised for AI traces. Patronus AI (previously positioned as an LLM eval platform; in 2026 has pivoted to a broader simulation and expert-evaluation focus) serves regulated workloads requiring third-party expert review. Promptfoo (open-source, now positions primarily as an AI security testing platform; eval features remain core) headlines 'Build Secure AI Applications' with red-team-first positioning and 300,000+ users.

This is not vendor churn for its own sake. The eval-first category dissolved because observability turned out to be the more urgent bottleneck in production. Once teams had AI features live, the question shifted from "how do I evaluate this model?" to "how do I understand what my model is doing right now?" Evals follow naturally from production traces — which is why observability platforms absorbed the eval tooling rather than competing against it.

Four vendors, four destinations — the eval-platform category in dissolution

What each tool actually is in 2026

Six tools mapped to their real 2026 positioning — not their 2024 descriptions.

The table below reflects May 2026 positioning. When a tool's primary category has changed from 2024, that change is noted explicitly. The 'best fit' column reflects where each tool genuinely wins — not all use cases are equal across the six.

The most important divergence for practitioners: Inspect AI (co-developed by the UK AI Security Institute and Meridian Labs) is the only entry in this table that has stayed consistently positioned as a pure evaluation framework. It has no observability aspirations, no security product to upsell, and no simulation layer. It is evaluation-as-code, stable, open-source, and framework-not-platform.

	Tool	Category (2026)	Eval features	Best fit
LangSmith Observability	LangSmith Observability (was: LangChain's eval + observability tool)	AI agent observability platform — evals are a bundled feature, not the headline	Production-trace-to-eval pipelines, dataset management, A/B comparison, regression detection	Teams in the LangChain ecosystem who want production observability with evals alongside
Braintrust	Braintrust (Series B, $80M raised; custom Brainstore database for AI traces)	AI observability platform — moved from pure eval platform positioning	Production traces converted to evals, prompt + model comparison, Brainstore-backed trace storage	Teams wanting an opinionated end-to-end observability + eval platform with strong UX
Promptfoo	Promptfoo (open-source; 300k+ users; "Build Secure AI Applications" headline)	AI security testing platform — pivoted from eval framework; red-team-first positioning	Open-source eval framework still core; YAML-driven test definitions; runs locally with no vendor dependency	Teams wanting open-source eval discipline with red-team capabilities bundled
Patronus AI	Patronus AI (previously positioned as an LLM eval platform; pivoted 2025 to simulation + expert evaluation)	World-simulation and expert-evaluation for regulated workloads	Synthetic benchmark generation, expert-in-the-loop scoring, third-party attestation	Regulated workloads needing third-party expert evaluation — narrower scope than 2024 positioning
OpenAI Evals API	OpenAI Evals API (the first-party hosted product; the openai/evals GitHub repo, 18.5k stars, remains as an open-source registry but is no longer where active development lives)	First-party hosted eval product within OpenAI's developer platform (API + dashboard UI)	Integrated with OpenAI's model suite; API-driven eval runs with dashboard result viewing	Teams primarily using OpenAI models who want first-party tooling with least-friction integration
Inspect AI	Inspect AI (co-developed by UK AI Security Institute + Meridian Labs; open framework, not platform)	Open evaluation framework — consistent positioning, no category drift since launch	Evaluation-as-code, reproducible runs, auditable results, full control over eval logic	●Research teams, regulated workloads, and any team wanting reproducible evaluations under version control

Eval-related tools and where they sit in May 2026

Where to start if you have nothing

Open-source eval framework first, observability platform second — reverse order is common but expensive.

If your team has no existing eval infrastructure, the two-stage adoption pattern that holds in practice is: open-source eval framework first, then observability platform once production traces become the bottleneck for eval data.

Start with Inspect AI or Promptfoo. Both are open-source, evaluation-as-code, and reproducible. Neither requires a contract or vendor relationship. They let you establish the eval discipline — dataset management, scoring rubric, regression detection — before you decide what platform to graduate to.

Graduate to an observability platform (LangSmith or Braintrust) when your production traces are generating more eval signal than your hand-built datasets can consume. At that point, the trace-to-eval pipeline these platforms offer is genuinely useful rather than premature abstraction.

Reach for OpenAI Evals API if you're fully OpenAI-native and want zero-friction first-party integration. Reach for Patronus AI only if you have regulated workloads that specifically require third-party expert review.

# promptfoo eval config
# install: npm install -g promptfoo
# run:     promptfoo eval

prompts:
  - "Summarise this support ticket concisely: {{ticket}}"

providers:
  - openai:gpt-4o
  - anthropic:claude-3-5-sonnet-20241022

tests:
  - vars:
      ticket: "Login fails on mobile Safari after the June update"
    assert:
      - type: contains
        value: "login"
      - type: llm-rubric
        value: "Does not mention features not present in the ticket"
      - type: max-tokens
        value: 100

Promptfoo eval config — YAML-driven, runs locally, no vendor dependency

// PRODUCTION

Two-stage adoption pattern that holds: open-source eval framework first (Inspect AI or Promptfoo), observability platform second (LangSmith or Braintrust) once production traces become your eval bottleneck. Reverse order is common but expensive — you end up paying for observability before you have evals to make sense of the traces.

What's volatile

This page is scored partial — three specific signals to watch before the August 2026 revisit.

This sub-page carries a partial score because the cluster is demonstrably in motion. Three specific things to watch between now and August 2026: pricing model changes at Braintrust and LangSmith (both Series-B-scale companies under growth pressure — enterprise tier definitions will shift as both push toward larger accounts), Patronus AI's positioning settling (the current pivot is roughly twelve months old and has not stabilised into a clear product story), and open-source ecosystem consolidation around Inspect AI (whether it becomes the canonical eval framework the way pytest became the canonical Python test runner).

Honest practitioner posture: pick the tool that fits this quarter based on your current bottleneck — eval discipline or production trace analysis. Don't architect for a vendor relationship that may look different by the time you've shipped.

Related glossary terms

Large Language Model (LLM) →