Eval platforms and tooling
If you went shopping for 'an eval platform' in 2024, you'd compare LangSmith, Braintrust, Patronus, and Promptfoo as roughly peer offerings in the same category. In 2026, that comparison no longer makes sense. LangSmith and Braintrust have moved to observability-first positioning. Patronus AI has pivoted up the stack to world-simulation and expert evaluation. Promptfoo has moved sideways to AI security testing. The pure 'eval platform' category has largely dissolved in two years — and most write-ups haven't caught up. What follows is honest about what each tool actually is in May 2026.
The category shift
Observability absorbed evals: four tools that competed as eval peers in 2024 have each moved to a different category by 2026.
The eval-platform category of 2024 had four reasonably peer offerings: LangSmith (LangChain's evaluation and observability tooling), Braintrust (a purpose-built eval workflow platform), Patronus (an LLM evaluation platform), and Promptfoo (an open-source eval framework). They competed on features, pricing, and integrations within a recognisable category.
By May 2026, each has moved. LangSmith is now LangSmith Observability — an AI agent observability platform where evals are a feature within the observability workflow, not the headline product. Braintrust positions as an 'AI observability platform for building quality AI products', has raised a Series B ($80M), and built a custom 'Brainstore' database optimised for AI traces. Patronus AI (previously positioned as an LLM eval platform; in 2026 has pivoted to a broader simulation and expert-evaluation focus) serves regulated workloads requiring third-party expert review. Promptfoo (open-source, now positions primarily as an AI security testing platform; eval features remain core) headlines 'Build Secure AI Applications' with red-team-first positioning and 300,000+ users.
This is not vendor churn for its own sake. The eval-first category dissolved because observability turned out to be the more urgent bottleneck in production. Once teams had AI features live, the question shifted from "how do I evaluate this model?" to "how do I understand what my model is doing right now?" Evals follow naturally from production traces — which is why observability platforms absorbed the eval tooling rather than competing against it.
What each tool actually is in 2026
Six tools mapped to their real 2026 positioning — not their 2024 descriptions.
The table below reflects May 2026 positioning. When a tool's primary category has changed from 2024, that change is noted explicitly. The 'best fit' column reflects where each tool genuinely wins — not all use cases are equal across the six.
The most important divergence for practitioners: Inspect AI (co-developed by the UK AI Security Institute and Meridian Labs) is the only entry in this table that has stayed consistently positioned as a pure evaluation framework. It has no observability aspirations, no security product to upsell, and no simulation layer. It is evaluation-as-code, stable, open-source, and framework-not-platform.
| Tool | Category (2026) | Eval features | Best fit | |
|---|---|---|---|---|
| LangSmith Observability | LangSmith Observability (was: LangChain's eval + observability tool) | AI agent observability platform — evals are a bundled feature, not the headline | Production-trace-to-eval pipelines, dataset management, A/B comparison, regression detection | Teams in the LangChain ecosystem who want production observability with evals alongside |
| Braintrust | Braintrust (Series B, $80M raised; custom Brainstore database for AI traces) | AI observability platform — moved from pure eval platform positioning | Production traces converted to evals, prompt + model comparison, Brainstore-backed trace storage | Teams wanting an opinionated end-to-end observability + eval platform with strong UX |
| Promptfoo | Promptfoo (open-source; 300k+ users; "Build Secure AI Applications" headline) | AI security testing platform — pivoted from eval framework; red-team-first positioning | Open-source eval framework still core; YAML-driven test definitions; runs locally with no vendor dependency | Teams wanting open-source eval discipline with red-team capabilities bundled |
| Patronus AI | Patronus AI (previously positioned as an LLM eval platform; pivoted 2025 to simulation + expert evaluation) | World-simulation and expert-evaluation for regulated workloads | Synthetic benchmark generation, expert-in-the-loop scoring, third-party attestation | Regulated workloads needing third-party expert evaluation — narrower scope than 2024 positioning |
| OpenAI Evals API | OpenAI Evals API (the first-party hosted product; the openai/evals GitHub repo, 18.5k stars, remains as an open-source registry but is no longer where active development lives) | First-party hosted eval product within OpenAI's developer platform (API + dashboard UI) | Integrated with OpenAI's model suite; API-driven eval runs with dashboard result viewing | Teams primarily using OpenAI models who want first-party tooling with least-friction integration |
| Inspect AI | Inspect AI (co-developed by UK AI Security Institute + Meridian Labs; open framework, not platform) | Open evaluation framework — consistent positioning, no category drift since launch | Evaluation-as-code, reproducible runs, auditable results, full control over eval logic | ●Research teams, regulated workloads, and any team wanting reproducible evaluations under version control |
Eval-related tools and where they sit in May 2026
Where to start if you have nothing
Open-source eval framework first, observability platform second — reverse order is common but expensive.
If your team has no existing eval infrastructure, the two-stage adoption pattern that holds in practice is: open-source eval framework first, then observability platform once production traces become the bottleneck for eval data.
Start with Inspect AI or Promptfoo. Both are open-source, evaluation-as-code, and reproducible. Neither requires a contract or vendor relationship. They let you establish the eval discipline — dataset management, scoring rubric, regression detection — before you decide what platform to graduate to.
Graduate to an observability platform (LangSmith or Braintrust) when your production traces are generating more eval signal than your hand-built datasets can consume. At that point, the trace-to-eval pipeline these platforms offer is genuinely useful rather than premature abstraction.
Reach for OpenAI Evals API if you're fully OpenAI-native and want zero-friction first-party integration. Reach for Patronus AI only if you have regulated workloads that specifically require third-party expert review.
# promptfoo eval config
# install: npm install -g promptfoo
# run: promptfoo eval
prompts:
- "Summarise this support ticket concisely: {{ticket}}"
providers:
- openai:gpt-4o
- anthropic:claude-3-5-sonnet-20241022
tests:
- vars:
ticket: "Login fails on mobile Safari after the June update"
assert:
- type: contains
value: "login"
- type: llm-rubric
value: "Does not mention features not present in the ticket"
- type: max-tokens
value: 100// PRODUCTION
What's volatile
This page is scored partial — three specific signals to watch before the August 2026 revisit.
This sub-page carries a partial score because the cluster is demonstrably in motion. Three specific things to watch between now and August 2026: pricing model changes at Braintrust and LangSmith (both Series-B-scale companies under growth pressure — enterprise tier definitions will shift as both push toward larger accounts), Patronus AI's positioning settling (the current pivot is roughly twelve months old and has not stabilised into a clear product story), and open-source ecosystem consolidation around Inspect AI (whether it becomes the canonical eval framework the way pytest became the canonical Python test runner).
Honest practitioner posture: pick the tool that fits this quarter based on your current bottleneck — eval discipline or production trace analysis. Don't architect for a vendor relationship that may look different by the time you've shipped.
// Read more