How I evaluate an AI chatbot before release
A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.
Series
Evaluating AI features when there's no single correct answer — and using AI on the test side without fooling yourself. Testing AI breaks the usual playbook: outputs vary, so you test properties instead of equality. This series covers evaluating AI features, reviewing AI-written tests, and where AI genuinely helps a QA workflow versus where it's a trap.
// overview
Testing AI breaks the playbook QA grew up on. There's no single correct output, so “does it equal the expected value?” stops working — and a lot of teams either freeze or wave the feature through. This series is about testing AI features anyway, by checking properties and boundaries instead of exact strings.
It covers both sides of AI in QA: evaluating AI products (hallucinations, refusals, scope, grounded facts), and using AI on the test side without fooling yourself — reviewing AI-written tests that pass for the wrong reasons, and knowing where AI genuinely saves time versus where it's a trap.
The throughline: AI changes what you assert and what you log, not whether you test. The judgement is still the job.
// reading order
A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.
LLMs can't reliably separate instructions from data, so user input can hijack the model. Direct and indirect injection, what to check for, and how to report it QA-safe.
A screenshot isn't a repro when outputs vary. Capture the full assembled prompt, retrieved context, model version, and parameters so an AI bug is actually reproducible.
AI writes plausible Playwright tests that pass for the wrong reasons. Here is the review checklist that catches them.
AI writes 80% of a test 80% of the way, and the remaining 20% is exactly the part that makes it a test. Where AI saves time, where it's a trap, and the distinction that separates the two.
The practical playbook for AI-assisted test writing in 2026. The prompts that work, the prompts that don't, and the human-in-the-loop checkpoints that keep AI from writing tests that pass for the wrong reasons.
Concrete test cases for AI hallucination — unanswerable questions, false premises, invented entities, citations — and how to judge answers with no 'correct' value.
Course
Tool