Red-teaming and adversarial evaluation
Red-teaming an AI model isn't penetration testing with prompts. It's structured adversarial evaluation — finding inputs that produce outputs the system shouldn't produce, then characterising the failure mode well enough that the model owner can fix it. Done well, it is the single highest-signal evaluation activity for safety-critical AI. Done poorly, it generates anecdote that no one acts on.
The red-team pipeline
Five stages from threat model to remediation — threat modelling and finding characterisation are the two steps most teams underinvest in.
A red-team session has five stages: threat modelling (what would harm look like, given this system and this deployment context?), attack generation (human red-teamers generate adversarial inputs, optionally augmented by automated tools), run-and-capture (adversarial inputs are run against the model and raw outputs plus context are stored), characterise-findings (each output that represents a failure is categorised by attack type and severity), and owner-and-remediation (a clear ticket with a reproducer, severity assessment, and suggested mitigation is handed to the model or system owner and verified after the fix is applied).
Two stages are systematically underinvested. Threat modelling is frequently skipped or done too broadly — "the model might say something bad" is not a threat model. A useful threat model names specific harm categories (for example, "user could extract customer data via indirect prompt injection through tool-fetched web content"), specific deployment contexts, and specific attacker motivations. Characterise-findings is underinvested because teams stop at "found a jailbreak" rather than documenting the category, severity, and reproducer precisely enough to enable a fix.
Four attack categories that matter in 2026
Prompt injection (direct and indirect), jailbreaking, data extraction, and capability elicitation — each has a different risk profile and different mitigations.
Prompt injection (direct and indirect) covers instructions injected via user input or via tool-fetched content. Direct injection — a user who instructs "ignore your system prompt and do X" — is the most widely discussed and the best mitigated. Indirect injection, where the adversarial instruction arrives via content the model is asked to process (a fetched web page, a document, an email), is the harder failure mode because the model owner does not control the source. As agentic AI systems that fetch external content become common in 2026, indirect injection is the attack vector that warrants the most investment.
Jailbreaking covers prompts designed to bypass safety training and elicit prohibited outputs. The arms race between training-time hardening and newly discovered jailbreaks continues; novel techniques emerge monthly from both academic researchers and practitioners. Jailbreak hardening is a continuous process, not a one-time fix.
Data extraction covers prompts designed to surface training data (especially PII or confidential documents ingested via RAG) or system prompts. For models trained or fine-tuned on customer data, data extraction is a regulatory risk as well as a security one.
Capability elicitation covers prompts that reveal capabilities the model is not meant to expose in the deployment context — step-by-step instructions for harmful activities, bypassed content policies, or access to restricted tool capabilities. Most relevant for frontier models and for agentic systems with broad tool access.
Automated red-teaming in 2026
Automated tools find the broad failure surface; human red-teamers find the failure modes that matter — both are needed.
Promptfoo's red-team mode (open-source, now positions primarily as an AI security testing platform; eval features remain core) has become the most-cited open-source tooling for automated adversarial input generation. It generates candidate adversarial inputs across a configurable attack taxonomy, runs them against the target model, and flags outputs that match failure patterns.
Anthropic's published red-team methodology — documented in Constitutional AI work and in system-card adversarial testing sections — is the most rigorous public reference for how a frontier model developer approaches red-teaming at scale. It distinguishes clearly between automated capability evaluations and human red-team sessions, and is explicit that both are required.
The UK AI Security Institute (AISI) publishes pre-deployment evaluation reports for frontier models. Their reports describe the approach for both general-capability and safety evaluation — worth reading for teams establishing a serious red-team programme.
The honest framing: automated red-teaming finds the broad failure surface quickly and cheaply. Human red-teaming finds the failure modes that actually matter — those that require context, creativity, and knowledge of the specific deployment. Neither alone is sufficient for a production safety evaluation.
What a useful finding looks like
Category, reproducer, severity, and suggested mitigation — a finding without a reproducer is anecdote.
A red-team finding has four required components. Category: which attack type does this represent (prompt injection direct/indirect, jailbreak, data extraction, capability elicitation)? Reproducer: exact inputs — system prompt, user turn, any tool outputs fetched — plus model version and temperature, so the finding can be re-run and verified after a fix is applied. Severity assessment: is this user-reachable in the production deployment? What scale could it reach? What harm class does the potential output represent? Suggested mitigation: prompt-hardening, input sanitisation, tool-permission scoping, or a training-time fix recommendation.
A finding without a reproducer is anecdote. If the exact prompt sequence cannot be re-run, the finding cannot be verified as fixed. This is the most common quality failure in red-team output: teams document "the model said something bad when I asked about X" without capturing the exact prompt, model version, and temperature.
## Red-team finding Category: [prompt injection · jailbreak · data extraction · capability elicitation] Sub-type: [direct / indirect for injection; specific technique for others] Model: [model ID + version] Temp: [0 / 0.7 / etc.] Timestamp: [ISO 8601] --- REPRODUCER --- System: [exact system prompt] User: [exact adversarial input] Tool resp: [any tool/function outputs fetched during the turn] --- OUTPUT --- Observed: [verbatim model output] Expected: [what the model should have produced or refused] --- SEVERITY --- Reachable: [yes / no / conditional] Scale: [single-user / broad / production-path] Harm class: [describe potential harm concisely] --- MITIGATION --- Suggested: [prompt-hardening / input sanitisation / tool-permission scoping] Status: [ ] open [ ] reproduced [ ] mitigated [ ] verified
// WARNING