External resources for AI testing
A curated, dated link directory of high-quality external resources for testing AI products. Reviewed quarterly. The selection criterion is concrete usefulness for a working QA engineer — not brand recognition or novelty.
// Guides
OWASP LLM Top 10
GuideThe canonical list of security risks for LLM-backed applications. Prompt injection, training data poisoning, sensitive information disclosure, and more.
↗ external resource
Anthropic prompt engineering guide
GuideOfficial guide for writing prompts that work. Useful for QA because you will be writing prompts to generate test cases, eval rubrics, and red-team probes.
↗ external resource
// Frameworks
OpenAI Evals
FrameworkOpenAI's open framework for evaluating LLMs. Lets you build and run evals against any model. The closest thing to a standard the field has.
↗ external resource
Ragas
FrameworkEvaluation framework specifically for RAG systems. Measures faithfulness, answer relevance, context precision and recall — the four metrics that matter most.
↗ external resource
DeepEval
FrameworkPytest-style framework for unit-testing LLM applications. Define expectations, run them like normal tests, fail builds on regression.
↗ external resource
HELM (Stanford)
FrameworkStanford's Holistic Evaluation of Language Models — a benchmark suite covering accuracy, calibration, robustness, fairness, bias, and toxicity. Reference reading for what good evaluation looks like.
↗ external resource
// Tools
Promptfoo
ToolOpen-source tool for testing and comparing LLM prompts. Run the same prompt across models, score outputs, catch regressions in CI.
↗ external resource
garak — LLM vulnerability scanner
ToolNVIDIA's LLM red-team scanner. Probes for prompt injection, jailbreaks, data leakage, and misinformation. Run it against your model the same way you would run a security scanner against a web app.
↗ external resource
LangSmith
ToolObservability and evaluation platform for LLM apps from the LangChain team. Trace runs, build datasets, evaluate over time. Free tier is genuinely usable.
↗ external resource
Phoenix (Arize AI)
ToolOpen-source observability for LLM apps. Visualise traces, debug retrieval, monitor hallucination rate. Self-hostable.
↗ external resource
// Blogs
Anthropic — Red-teaming and evaluation
BlogAnthropic's research on red-teaming Claude, evaluation methodology, and safety. Worth reading for anyone testing LLM-backed products in production.
↗ external resource
Prompt injection primer — Simon Willison
BlogThe definitive primer on prompt injection. Read this before you let an LLM near user input or external content. Old enough to be canonical, new enough to still apply.
↗ external resource
AI Engineer's Handbook (Chip Huyen)
BlogChip Huyen's overview of building production LLM systems. Half of it is about evaluation, which means half of it is QA work. Worth reading end to end.
↗ external resource
W&B LLM evaluation guides
BlogWeights & Biases' practical articles on LLM evaluation patterns. Concrete enough to copy, opinionated enough to follow.
↗ external resource