Context Window

AI & LLM Testing

// Definition

The maximum number of tokens (roughly ¾ of a word each) an LLM can consider in a single inference call — the total of the system prompt, conversation history, retrieved documents, and the model's own generated output. When input exceeds the window, tokens are truncated (typically from the middle or start), which can silently drop instructions or facts. QA implications: test behaviour at high token counts near the window limit, verify the application chunks or summarises long inputs rather than silently truncating, and confirm truncation does not cause the model to discard critical system-level instructions.

// Related terms

Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Retrieval-Augmented Generation (RAG)
A pattern where an LLM is given relevant context retrieved from an external source (a vector database, a search index, a document store) before being asked to generate an answer. The LLM doesn't 'know' the answer from training — it reads what was retrieved and synthesises a response. RAG is how chatbots answer questions about your company's docs without those docs being baked into the model. From a QA perspective, RAG systems have two failure surfaces: retrieval (did the system find the right context?) and generation (did the LLM use the context faithfully, or did it hallucinate?). Testing must cover both, separately.
System Prompt
Instructions sent to an LLM before the conversation begins, used to establish persona, rules, scope, and constraints for the session. Not visible to end users in most product interfaces, but not cryptographically protected — prompt injection and jailbreaking attempt to override or leak it. QA test cases include: does the model follow its instructions under normal conditions? Does it resist attempts to override them? Can an attacker elicit the prompt contents via indirect questions? Are sensitive values (internal instructions, scoped credentials) ever echoed back to the user?