Setting up an internal AI red-team

10 min read · Reviewed May 2026 · emerging practice

External red-teaming is a different problem from what most teams need day-to-day. Frontier labs run expensive external red-team campaigns before model releases; the rest of us need something smaller, faster, and integrated into the sprint cycle. An internal AI red-team — two or three QA engineers, weekly cadence, scope of one feature at a time — is closer to what works in practice. The methodology is covered in the adversarial evaluation sub-page; this sub-page is about the operating model.

READ TIME10 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNPARTIAL — internal red-team practice is emerging. What a small internal QA-led red-team actually does, the cadence that fits a sprint cycle, and the trap of treating red-team findings as regression tests. Score state: partial — internal-red-team operating models are not yet standardised. Last reviewed May 2026; next revisit September 2026.

The internal red-team cycle

A weekly cycle scoped to one feature — scope, attack, triage, fix, then regression at the pattern level.

The cycle that fits a standard two-week sprint runs five stages across one week. Scoping (Monday) pins the specific feature and the three attack categories to exercise this cycle. Active attack generation (Tuesday–Wednesday) is time-boxed to two person-days — the constraint forces selectivity and prevents the session from expanding into an unbounded exercise. Triage (Thursday) produces the finding log with severity classifications. Fix and retest (Friday) closes what can be closed in the sprint; the remainder enters the backlog with risk context. Re-test items enter the regression suite on the following sprint, encoded as pattern-level coverage — not prompt verbatims.

A weekly internal red-team cycle

Scope per cycle

One feature, three attack categories, two person-days — the constraint is the discipline.

"Red-team our AI" is not a scope. "Test the chatbot's refusal behaviour against prompt injection this week" is. The single-feature constraint is not a limitation — it is how the team builds depth rather than producing a shallow pass across everything.

The three-attack-category-per-cycle limit provides a natural forcing function for prioritisation. Rotate categories across cycles to ensure all four are covered quarterly: direct prompt injection, indirect prompt injection, jailbreaking via persona framing, and capability-boundary probing. The four-category taxonomy is covered in detail in the adversarial evaluation sub-page.

Reference frameworks for calibrating severity: MITRE ATLAS provides the adversarial ML attack taxonomy — use it for attack-type classification and to ensure your vocabulary is consistent with industry language. Anthropic RSP v3.2 (April 2026 — note: actively versioned, with v3.0, v3.1, and v3.2 all landing in 2026) defines capability thresholds and corresponding safety commitments that can inform your severity vocabulary even for product features far smaller than a frontier model. OpenAI's Preparedness Framework similarly defines tracked capability categories for systemic-risk calibration.

ai:Red-teaming and adversarial evaluation

Triage criteria

Four severity tiers calibrated to product context — not frontier-capability thresholds.

Finding triage uses a four-field structure: attack category (MITRE ATLAS label or internal equivalent), minimal reproducer (the shortest input sequence that reliably triggers the finding), severity tier, and suggested mitigation. The severity vocabulary is calibrated to product context, not frontier-model risk levels.

Four tiers work in practice. "Could reach a user unintentionally" — no adversarial effort required, the default path surfaces it — is a release blocker. "Real harm at scale" — a plausible user path that could cause material harm across many users — is a next-sprint item. "Reputational harm" — content a reasonable person would object to, but requiring deliberate adversarial effort — is a quarterly priority. "Edge-case only" — requires significant effort to trigger, no realistic user path — is tracked but not scheduled unless the feature is high-risk.

## Finding: [short title]

**Date:** YYYY-MM-DD
**Reviewer:** [name or handle]
**Sprint:** [sprint ID]

### Attack category
<!-- MITRE ATLAS tactic label or internal category -->

### Reproducer
```
[minimal prompt or action sequence that reliably triggers the finding]
```

### Observed behaviour
<!-- What the model actually did -->

### Expected behaviour
<!-- What it should have done -->

### Severity
<!-- reaches-user-unintentionally | real-harm-at-scale | reputational-harm | edge-case-only -->

### Pattern class
<!-- Broader class of inputs this reproducer belongs to — used for regression coverage -->

### Suggested mitigation
<!-- System prompt change, guard rail, training note, or product change -->

### Status
<!-- open | fixed | accepted-risk | deferred -->

Red-team finding template — one file per finding, committed to the repo alongside the test suite

The regression-test trap

Don't add red-team findings to the regression suite verbatim — encode the failure pattern, not the prompt.

The most common failure mode after a red-team cycle: the finding gets fixed, the exact prompt gets added to the regression suite, and the next model variant passes because it was trained on or near that specific input. The regression test now measures nothing — it passes because the particular prompt was seen, not because the underlying capability gap was closed.

The correct approach is to encode the failure pattern as the regression input class, not the specific prompt. If the finding was "model reveals a system prompt when asked to repeat the first word of every sentence", the regression should cover the class of indirect extraction attempts, not that one phrasing. This requires characterising what makes the finding a member of a broader class — more work, but the only coverage that remains valid across model updates.

// WARNING

Don't let red-team findings become regression tests verbatim. The exact prompt that exposed the vulnerability will pass on the next model version. Encode the FAILURE PATTERN as the regression: a class of inputs that exercises the underlying capability gap, not just the one prompt that originally triggered it.

Voluntary commitments as scaffolding

Borrowing severity vocabularies from frontier-model frameworks even when your product is much smaller.

Anthropic RSP v3.2 (April 2026) and OpenAI's Preparedness Framework define capability thresholds and severity tiers for frontier-model red-teaming. These thresholds are calibrated for frontier systems — your product almost certainly does not warrant the same scrutiny for autonomous replication capability. But the vocabulary is useful: the categories they track (CBRN-uplift, cyber offence, persuasion at scale, automated AI development) provide the right frame for distinguishing product-context risk from genuine systemic risk.

Practically: use the frontier vocabulary to justify descoping. If a finding lives far below the RSP capability threshold, document why it is a product-context reputational risk rather than a systemic safety concern. That documentation matters for audit trails and internal risk decisions — it demonstrates that the severity classification was deliberate, not arbitrary.

Related glossary terms

Prompt injection →