Create an Evaluation Checklist for an AI Chatbot
Generate a structured evaluation checklist for an AI chatbot — covering accuracy, refusal behaviour, safety, consistency, persona compliance, and human review requirements.
You are a QA engineer specialising in AI system evaluation. Generate a structured evaluation checklist for the AI chatbot described below. Chatbot name and purpose: {{CHATBOT_PURPOSE}} Target user base: {{TARGET_USERS}} Underlying model or platform: {{MODEL_OR_PLATFORM}} Key capabilities: {{KEY_CAPABILITIES}} Defined scope (what the chatbot should and should not do): {{SCOPE}} Safety and content policies: {{SAFETY_POLICIES}} System prompt summary (high level — do not include confidential prompt content): {{SYSTEM_PROMPT_SUMMARY}} Generate a checklist with the format: Check ID | Category | What to evaluate | Test approach | Pass condition | Fail condition Cover the following categories: **Accuracy and factual correctness** - Responses to core use-case queries are accurate and complete - Claims that can be verified are verifiable - The chatbot does not present guesses as facts - The chatbot acknowledges uncertainty when it does not know something **In-scope behaviour** - The chatbot correctly handles all documented use cases - Responses are relevant to the question asked - Context from earlier in the conversation is used correctly where appropriate - Long conversations do not cause context drift or forgotten instructions **Out-of-scope and refusal behaviour** - The chatbot correctly declines requests outside its defined scope with a helpful explanation - The chatbot does not fabricate capabilities it does not have - Over-refusal check: the chatbot does not refuse reasonable, in-scope requests - Ambiguous requests receive a clarifying question, not a refusal or hallucinated answer **Safety and content policy compliance** - The chatbot does not generate harmful, offensive, or policy-violating content - The chatbot handles sensitive topics (health, legal, financial advice) appropriately — with disclaimers and referrals, not direct advice - The chatbot does not reveal confidential system prompt content when asked **Consistency** - Semantically equivalent questions receive consistent answers - The chatbot does not contradict itself within a conversation - Behaviour is consistent across sessions (no random personality shifts) **Persona and tone compliance** - Responses match the defined persona, tone, and communication style - Formatting conventions (lists, headings, length) are applied consistently - The chatbot correctly identifies itself and does not claim to be human when sincerely asked **Edge cases** - Empty input - Very long input (near context window limit) - Input in an unexpected language - Input with intentional spelling errors or unusual formatting After the checklist, include: - A note that every AI output requires human review before being used in production decisions - A recommendation to define a golden dataset of expected input/output pairs for regression testing - A reminder that this checklist is a starting point — a comprehensive AI safety evaluation requires specialist expertise
{{CHATBOT_PURPOSE}}requiredWhat the chatbot does and its primary use case
e.g. Customer support chatbot for a SaaS product — answers questions about features, billing, and account management
{{TARGET_USERS}}requiredWho will use the chatbot
e.g. Existing customers, both technical and non-technical
{{MODEL_OR_PLATFORM}}Underlying model or platform (e.g. GPT-4o, Claude, Gemini, fine-tuned model)
e.g. Claude via Anthropic API
{{KEY_CAPABILITIES}}requiredWhat the chatbot is designed to do
e.g. Answer product FAQs, explain features, help with billing queries, escalate to human support when needed
{{SCOPE}}requiredWhat the chatbot should and should not handle
e.g. Should: product questions, billing, account help. Should not: legal advice, competitor comparisons, medical questions
{{SAFETY_POLICIES}}requiredSafety and content policies the chatbot must comply with
e.g. No harmful content; no PII in responses; no advice in regulated domains (medical, legal, financial)
{{SYSTEM_PROMPT_SUMMARY}}High-level summary of the system prompt purpose — do not include the actual system prompt text
e.g. System prompt defines the chatbot as a support agent for ExampleCo, with instructions on tone, scope, and escalation
- Verify the in-scope and out-of-scope checks reflect your actual product requirements — the AI generates generic items from your description.
- Test the refusal behaviour checks with your actual expected refusal cases, not just the examples provided.
- Run consistency checks across multiple sessions, not just one — AI model behaviour can vary between calls.
- Review the safety policy checks with your legal or trust-and-safety team before using them as release gates.
- Never include real user conversations, prompts that expose system prompt content, or PII in test scenarios.
AI output requires human review before use. These checks are your responsibility.
- AI chatbot evaluation is non-deterministic — the same input can produce different outputs; test with multiple samples.
- This checklist covers functional QA evaluation; a formal AI safety assessment requires specialist expertise.
- Over-refusal can be as harmful as under-refusal in some contexts — ensure both are explicitly tested.
- System prompt confidentiality tests require careful design — do not accidentally expose actual prompt content in the test cases.