External resources for AI testing

OpenAI's open framework for evaluating LLMs. Lets you build and run evals against any model. The closest thing to a standard the field has.

↗ external resource

Ragas

Evaluation framework specifically for RAG systems. Measures faithfulness, answer relevance, context precision and recall — the four metrics that matter most.

↗ external resource

DeepEval

Pytest-style framework for unit-testing LLM applications. Define expectations, run them like normal tests, fail builds on regression.

↗ external resource

HELM (Stanford)

Stanford's Holistic Evaluation of Language Models — a benchmark suite covering accuracy, calibration, robustness, fairness, bias, and toxicity. Reference reading for what good evaluation looks like.

↗ external resource

// Tools

Promptfoo

Open-source tool for testing and comparing LLM prompts. Run the same prompt across models, score outputs, catch regressions in CI.

↗ external resource

garak — LLM vulnerability scanner

NVIDIA's LLM red-team scanner. Probes for prompt injection, jailbreaks, data leakage, and misinformation. Run it against your model the same way you would run a security scanner against a web app.

↗ external resource

LangSmith

Observability and evaluation platform for LLM apps from the LangChain team. Trace runs, build datasets, evaluate over time. Free tier is genuinely usable.

↗ external resource

Phoenix (Arize AI)

Open-source observability for LLM apps. Visualise traces, debug retrieval, monitor hallucination rate. Self-hostable.

↗ external resource

// Blogs

Anthropic — Red-teaming and evaluation

Anthropic's research on red-teaming Claude, evaluation methodology, and safety. Worth reading for anyone testing LLM-backed products in production.

↗ external resource

Prompt injection primer — Simon Willison

The definitive primer on prompt injection. Read this before you let an LLM near user input or external content. Old enough to be canonical, new enough to still apply.

↗ external resource

AI Engineer's Handbook (Chip Huyen)

Chip Huyen's overview of building production LLM systems. Half of it is about evaluation, which means half of it is QA work. Worth reading end to end.

↗ external resource

W&B LLM evaluation guides