Hallucination
// Definition
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.
// Why it matters
A hallucination is fluent, confident output that's factually wrong or unsupported — the model's most dangerous failure because it looks right. QA can't assert exact strings against a non-deterministic model, so testing shifts to grounding and evaluation: does the answer cite real sources, stay within provided context, and pass an eval set rather than a single golden string?
// How to test
// You can't assert exact text on a probabilistic model — assert grounding.
cy.request({ method: 'POST', url: '/api/ask', body: { q: 'What is our refund window?' } })
.then((res) => {
// answer must be grounded in retrieved policy, not invented
expect(res.body.sources, 'cited sources').to.have.length.greaterThan(0)
expect(res.body.answer).to.match(/\d+\s*days/) // grounded fact present
})
// Scale this with an eval set + LLM-as-judge, not one-off assertions.// Common mistakes
- Asserting exact output strings against a non-deterministic model (flaky by design)
- No grounding check — accepting a confident answer with zero sources
- One golden example instead of an eval set across many inputs
// Related terms
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Prompt Engineering
The craft of writing inputs to AI tools — language models, chat assistants, coding assistants — so that the output is useful, specific, and aligned with the task. Core principles include being specific about format, providing project context (existing patterns, conventions, examples), asking for chain-of-thought reasoning, enumerating edge cases up front, and iterating across multiple turns rather than expecting a perfect first response.