Non-determinism
// Definition
Behaviour where the same input doesn't always produce the same output. In classical testing this is the cause of flaky tests — race conditions, time-of-day bugs, unstable network — and the response is to hunt down the source and eliminate it. In AI-backed systems, non-determinism is intrinsic to the model itself: an LLM with a non-zero temperature will give different answers to the same prompt, by design. The QA implication is that the same tactic — eliminate variance — doesn't work; you have to measure variance instead. Tolerance budgets, score distributions, and agreement-rate metrics replace pass/fail counts for the AI parts of a system, while the deterministic plumbing around it (auth, routing, database writes) keeps its classical test treatment.
// Related terms
Deterministic vs probabilistic testing
Traditional software tests are deterministic: same input, same output, pass or fail. AI-backed features are probabilistic: same input can give different outputs, and "correctness" is a distribution rather than a binary. This isn't a small distinction — it breaks most of the assumptions baked into existing test frameworks. Exact-match assertions stop being useful. Flaky-test detection logic flags real model variance as a bug. The unit of measurement shifts from "this test passed" to "this prompt scored 0.87 on average across the eval set, up from 0.83 last week." Senior testers working on AI features spend more time defining what correctness means for a given feature than they do writing assertions.
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.