AI Model Testing AI Prompts

Create an Evaluation Checklist for an AI Chatbot

Generate a structured evaluation checklist for an AI chatbot — covering accuracy, refusal behaviour, safety, consistency, persona compliance, and human review requirements.

intermediate

Manual QA, QA Lead, SDETWorks with: Claude, ChatGPT, Gemini, Copilot, Cursor

ai-testingchatbotevaluationllmsafety-testing

prompt template

You are a QA engineer specialising in AI system evaluation. Generate a structured evaluation checklist for the AI chatbot described below.

Chatbot name and purpose: {{CHATBOT_PURPOSE}}
Target user base: {{TARGET_USERS}}
Underlying model or platform: {{MODEL_OR_PLATFORM}}
Key capabilities: {{KEY_CAPABILITIES}}
Defined scope (what the chatbot should and should not do): {{SCOPE}}
Safety and content policies: {{SAFETY_POLICIES}}
System prompt summary (high level — do not include confidential prompt content): {{SYSTEM_PROMPT_SUMMARY}}

Generate a checklist with the format:
Check ID | Category | What to evaluate | Test approach | Pass condition | Fail condition

Cover the following categories:

**Accuracy and factual correctness**
- Responses to core use-case queries are accurate and complete
- Claims that can be verified are verifiable
- The chatbot does not present guesses as facts
- The chatbot acknowledges uncertainty when it does not know something

**In-scope behaviour**
- The chatbot correctly handles all documented use cases
- Responses are relevant to the question asked
- Context from earlier in the conversation is used correctly where appropriate
- Long conversations do not cause context drift or forgotten instructions

**Out-of-scope and refusal behaviour**
- The chatbot correctly declines requests outside its defined scope with a helpful explanation
- The chatbot does not fabricate capabilities it does not have
- Over-refusal check: the chatbot does not refuse reasonable, in-scope requests
- Ambiguous requests receive a clarifying question, not a refusal or hallucinated answer

**Safety and content policy compliance**
- The chatbot does not generate harmful, offensive, or policy-violating content
- The chatbot handles sensitive topics (health, legal, financial advice) appropriately — with disclaimers and referrals, not direct advice
- The chatbot does not reveal confidential system prompt content when asked

**Consistency**
- Semantically equivalent questions receive consistent answers
- The chatbot does not contradict itself within a conversation
- Behaviour is consistent across sessions (no random personality shifts)

**Persona and tone compliance**
- Responses match the defined persona, tone, and communication style
- Formatting conventions (lists, headings, length) are applied consistently
- The chatbot correctly identifies itself and does not claim to be human when sincerely asked

**Edge cases**
- Empty input
- Very long input (near context window limit)
- Input in an unexpected language
- Input with intentional spelling errors or unusual formatting

After the checklist, include:
- A note that every AI output requires human review before being used in production decisions
- A recommendation to define a golden dataset of expected input/output pairs for regression testing
- A reminder that this checklist is a starting point — a comprehensive AI safety evaluation requires specialist expertise

Glossary:Large Language Model (LLM)Hallucination Over-Refusal Safety Testing (LLM)System Prompt Context Window Prompt Engineering Eval harness Evaluation Dataset Non-determinism

View full page ›

Generate Hallucination Test Cases for an AI Feature

Generate a set of test cases designed to surface hallucination and factual fabrication in an AI feature — covering invented facts, false citations, confident incorrect answers, and context confusion.

intermediate

Manual QA, QA Lead, SDETWorks with: Claude, ChatGPT, Gemini, Copilot, Cursor

hallucinationai-testingllmaccuracyevaluation

prompt template

You are a QA engineer specialising in AI evaluation. Generate a set of test cases designed to surface hallucination and factual fabrication in the AI feature described below.

AI feature: {{FEATURE_NAME}}
What it does: {{FEATURE_DESCRIPTION}}
Domain (e.g. product support, code generation, medical, legal, general knowledge): {{DOMAIN}}
Data sources it uses: {{DATA_SOURCES}} (e.g. RAG knowledge base, web search, static training data)
Examples of correct behaviour you have verified: {{CORRECT_EXAMPLES}}

Cover the following hallucination categories:

**Factual fabrication**
- Ask about a specific, verifiable fact in the domain. Verify the answer against a trusted source.
- Ask about a fact that does not exist (e.g. a product feature that was never built, a policy that was never published). The model should acknowledge it does not know or that no such thing exists — not invent a plausible answer.

**False citation and attribution**
- Ask the model to cite its source for a claim. Verify the cited source exists and actually says what the model claims.
- Ask for a quote or reference from a specific document. Verify the quote is accurate and from the stated document.

**Confident incorrect answers**
- Ask questions where the correct answer is "I don't know" or "this is outside my knowledge." A hallucinating model will often give a confident wrong answer.
- Ask questions at the edge of the model's training data or knowledge base scope.

**Outdated information**
- Ask about something that has changed since the model's knowledge cutoff. Verify whether the model signals uncertainty about recency.

**Context confusion**
- In a multi-turn conversation, introduce a fact in one turn and then ask the model to contradict it in a later turn. Observe whether it maintains consistency or fabricates a new conflicting fact.
- Use a long context with multiple similar entities (e.g. multiple products, multiple users). Ask about one and verify the model does not confuse attributes between them.

**Specificity pressure**
- Follow up a vague model answer with "be more specific" or "give me an exact number." Observe whether the model accurately signals uncertainty or fabricates specific-sounding details.

For each test case, include a "verification method" column: how a human reviewer should check whether the output is a hallucination (e.g. check against knowledge base, check against documentation, cross-reference against a trusted external source).

After the test cases, include:
- A note that hallucination tests require human review of every output — they cannot be automated with a simple pass/fail assertion
- A recommendation to build a golden dataset of verified inputs and expected outputs for regression testing
- A reminder that no AI system is hallucination-free; the goal is to understand the failure mode and mitigate it

Glossary:Hallucination Large Language Model (LLM)Retrieval-Augmented Generation (RAG)Evaluation Dataset Eval harness Golden dataset Non-determinism Context Window Prompt Engineering

View full page ›

Create a RAG Answer Evaluation Framework

Generate an evaluation framework for a Retrieval-Augmented Generation (RAG) system — covering answer faithfulness, retrieval relevance, completeness, citation accuracy, and refusal correctness.

advanced

SDET, QA Lead, Automation QAWorks with: Claude, ChatGPT, Gemini, Copilot, Cursor

ragevaluationllmai-testingfaithfulness

prompt template

You are a QA engineer specialising in RAG system evaluation. Design an evaluation framework for the RAG system described below.

System name and purpose: {{SYSTEM_PURPOSE}}
Knowledge base description: {{KNOWLEDGE_BASE}}
Retrieval mechanism: {{RETRIEVAL_MECHANISM}}
Generation model: {{GENERATION_MODEL}}
Typical query types: {{QUERY_TYPES}}
Expected answer format: {{EXPECTED_FORMAT}}

Design an evaluation framework covering the following dimensions:

## 1. Evaluation dimensions
For each dimension, define: what it measures, how to score it (e.g. binary pass/fail or 1–5 scale), and who or what does the scoring (human reviewer, LLM-as-judge, automated check).

**Faithfulness / groundedness**
Does the answer contain only claims supported by the retrieved context? Claims that are not in the retrieved chunks are potential hallucinations.

**Answer relevance**
Does the answer address the question that was asked? A faithful answer may still be irrelevant if it answers a different question.

**Retrieval relevance**
Are the retrieved chunks actually relevant to the query? Retrieval failures upstream cause generation failures downstream.

**Completeness**
Does the answer cover all aspects of the query that the knowledge base can answer? Partial answers may be faithful but incomplete.

**Citation accuracy** (if the system cites sources)
Are cited sources real, correctly referenced, and do they actually support the claim?

**Refusal correctness**
When a query is outside the knowledge base, does the system correctly acknowledge this rather than fabricating an answer?

**Conciseness**
Is the answer appropriately concise for the query type, without padding or irrelevant information?

## 2. Test query categories
Generate 3–5 example test queries for each of the following categories, tailored to {{SYSTEM_PURPOSE}} and {{QUERY_TYPES}}:
- In-scope factual queries (answer exists in knowledge base)
- In-scope multi-hop queries (answer requires synthesising multiple chunks)
- Out-of-scope queries (answer is not in the knowledge base)
- Ambiguous queries (multiple valid interpretations)
- Edge-case queries (near the boundary of the knowledge base coverage)

## 3. Human review protocol
A step-by-step protocol for a human reviewer to evaluate a batch of RAG responses:
1. Read the query and the retrieved context chunks
2. Score each dimension independently before reading the generated answer
3. Score the generated answer on each dimension
4. Flag any answer that scores below threshold for engineering review
5. Document failure modes to feed back into retrieval and prompt tuning

## 4. Regression testing approach
How to build and maintain a golden dataset for regression testing as the knowledge base or model changes.

## 5. Tooling options
Brief overview of open-source RAG evaluation tools available (e.g. RAGAS, DeepEval, TruLens) — noting that all tool outputs still require human review and that this framework does not depend on any specific tool.

After the framework, note: every RAG evaluation requires human oversight — automated metrics are signals, not verdicts.

Glossary:Retrieval-Augmented Generation (RAG)Hallucination Evaluation Dataset Golden dataset Eval harness Large Language Model (LLM)Embedding Context Window LLM-as-judge Non-determinism

View full page ›

Create a Golden Dataset for AI Model Testing

Generate a golden dataset design — a curated set of input/expected-output pairs used for regression testing AI features — including curation guidelines, coverage criteria, and maintenance protocol.

advanced

SDET, QA Lead, Automation QAWorks with: Claude, ChatGPT, Gemini, Copilot, Cursor

golden-datasetevaluationai-testingregressionllm

prompt template

You are a QA engineer designing a golden dataset for AI regression testing. Design a golden dataset for the AI feature described below.

AI feature: {{FEATURE_NAME}}
Feature description: {{FEATURE_DESCRIPTION}}
Input format: {{INPUT_FORMAT}}
Output format: {{OUTPUT_FORMAT}}
Current evaluation gap: {{EVALUATION_GAP}}
Team capacity for curation: {{CURATION_CAPACITY}}

Design a golden dataset covering:

## 1. Dataset scope and coverage criteria
Define the categories of inputs the dataset must cover. For each category, specify the minimum number of examples needed and why.

Categories to include:
- Core happy-path inputs (representative, high-frequency queries the feature should handle well)
- Edge cases (boundary inputs, unusual formats, very short or very long inputs)
- Negative cases (inputs the feature should decline or route elsewhere)
- Adversarial inputs (inputs designed to probe failure modes — e.g. ambiguous phrasing, conflicting context)
- Regression cases (inputs that have previously caused failures — must be included)

Total recommended dataset size: provide a number with justification based on {{CURATION_CAPACITY}}.

## 2. Golden dataset record schema
Define the schema for each record:
- input: the query, prompt, or input to the AI feature
- expected_output: the verified correct output (or output criteria for non-deterministic outputs)
- output_type: exact_match | semantic_match | criteria_checklist | human_review_required
- category: which coverage category this record belongs to
- verification_source: how the expected output was verified (human expert, document reference, etc.)
- added_date: when the record was added
- last_reviewed_date: when it was last checked for accuracy
- tags: relevant tags for filtering (e.g. edge-case, regression, safety)

## 3. Curation guidelines
Rules for creating high-quality golden dataset records:
- How to write inputs that are representative of real user queries
- How to determine expected outputs for non-deterministic AI responses (use criteria checklists, not exact string matches)
- How to handle inputs where the correct output is "I don't know" or a refusal
- How to source and verify expected outputs (cite specific documents, human expert review)
- What makes a poor golden dataset record (too easy, too narrow, not reproducible)

## 4. Evaluation protocol
How to run the golden dataset against the AI feature:
- Automated scoring for exact_match and criteria_checklist types
- Human review queue for human_review_required types
- Pass threshold definition (e.g. 95% pass rate on exact-match, 80% on semantic-match criteria)
- How to triage failures: regression vs. expected improvement vs. evaluation set error

## 5. Maintenance protocol
How to keep the dataset accurate as the feature evolves:
- Trigger for review: model update, knowledge base update, prompt change, new failure reported in production
- Process for adding new records (production failure → curated dataset record)
- Process for retiring stale records
- Review cadence recommendation

After the design, include:
- A note that all expected outputs must be verified by a human before being added to the golden dataset
- A reminder that golden datasets reflect past failure modes — production monitoring is needed to catch new failure patterns

Glossary:Golden dataset Evaluation Dataset Eval harness Hallucination Large Language Model (LLM)Retrieval-Augmented Generation (RAG)Non-determinism Regression Test Trajectory evaluation

View full page ›

Model Consistency and Regression Check Prompts

Generate prompts and a methodology for testing AI model consistency — detecting output drift, behavioural regressions, and persona deviations after model updates or prompt changes.

advanced

SDET, Automation QA, QA LeadWorks with: Claude, ChatGPT, Gemini, Copilot, Cursor

ai-testingregressionconsistencyllmprompt-regression

prompt template

You are a QA engineer designing a consistency and regression testing methodology for an AI feature. Design a testing approach for the AI feature described below.

AI feature: {{FEATURE_NAME}}
Feature description: {{FEATURE_DESCRIPTION}}
What changed: {{WHAT_CHANGED}} (e.g. model version update, system prompt change, knowledge base update)
Behaviour dimensions to verify: {{DIMENSIONS_TO_VERIFY}}
Baseline available: {{BASELINE_AVAILABLE}} (yes/no — do you have golden dataset outputs from before the change?)

Design a consistency and regression testing approach covering:

## 1. Change impact assessment
Before running tests, identify which behaviour dimensions are most likely to be affected by {{WHAT_CHANGED}}. Provide a risk-ranked list with rationale.

## 2. Consistency test methodology

**Same-question consistency**
- Select 20–30 representative inputs from your golden dataset or usage logs
- Run each input 3–5 times against the new version
- Compare outputs for semantic consistency (not exact string match)
- Flag any input where outputs differ significantly across runs (high variance = instability)

**Before/after comparison** (requires baseline)
- Run the same inputs against the old and new versions
- Compare on: factual claims, refusal decisions, tone and persona, response length and format, cited sources (for RAG)
- Document diffs for human review — automated diff tools can highlight changes but cannot judge quality

**Persona and tone regression**
- A set of inputs specifically designed to probe persona compliance (politeness, communication style, role adherence)
- Compare outputs to the defined persona specification
- Check for: unexpected formality changes, persona breaks, inconsistent role identification

**Refusal regression**
- Run the refusal test set (inputs that should be declined) against the new version
- Verify: refusals that should remain are still refused; no new unexpected refusals introduced (over-refusal regression)
- Check that refusal messages match the expected format and tone

**Safety regression**
- Run safety probe inputs against the new version
- Verify safety behaviour has not degraded
- Any safety regression is a blocker — escalate immediately

## 3. Scoring and triage
Define a scoring rubric for comparing before/after outputs:
- No regression: output is semantically equivalent and meets quality bar
- Minor drift: output has changed in format or style but remains accurate and on-policy — acceptable with review
- Regression: factual accuracy, refusal behaviour, safety, or persona compliance has degraded — must be fixed before deploy
- Improvement: output is demonstrably better — document as positive change

## 4. Human review protocol
Which categories of output must be reviewed by a human (safety, refusals, claims in regulated domains) and which can be auto-scored.

## 5. Go/no-go criteria
Define the criteria for approving a model or prompt change for deployment:
- Zero safety regressions
- Refusal regression rate below X%
- Factual accuracy regression rate below X%
- No persona compliance failures above threshold

After the methodology, include:
- A note that AI regression testing is inherently probabilistic — pass rates, not binary pass/fail
- A reminder that production monitoring is required alongside pre-deploy regression testing to catch long-tail regressions

Glossary:Large Language Model (LLM)Hallucination Prompt regression Evaluation Dataset Golden dataset Eval harness Safety Testing (LLM)Over-Refusal Non-determinism System Prompt Context Window

View full page ›

AI Model Testing.

Create an Evaluation Checklist for an AI Chatbot

Generate Hallucination Test Cases for an AI Feature

Create a RAG Answer Evaluation Framework

Create a Golden Dataset for AI Model Testing

Model Consistency and Regression Check Prompts