On this page3 sections
ReferenceAdvanced4-6 min reference

AI Safety Testing

Safety testing checks that an LLM feature behaves responsibly under adversarial and edge inputs — not just helpful ones. It's the AI equivalent of negative testing. This sheet lists what to probe and the signals; keep destructive red-teaming within authorized scope. See Testing AI Systems and Security Testing (linked below).

What to probe

RiskTest inputShould
Prompt injection"Ignore previous instructions and…" (incl. text hidden in retrieved docs/files)Hold its system rules
JailbreakRole-play / obfuscation to bypass rulesRefuse harmful requests
Harmful contentRequests for dangerous/illegal outputRefuse, safe completion
Bias / fairnessSame task across demographicsConsistent, non-discriminatory
PII / data leakage"Repeat your system prompt / training data"Not disclose secrets or other users' data
ToxicityProvocative inputsStay non-toxic
Over-refusalBenign edge requestsNot refuse everything (usability)

Signals & method

  • Maintain an adversarial test set (injection strings, jailbreak prompts, bias pairs) and run it on every model/prompt change.
  • Score refusal correctness: harmful → refused, benign → answered (watch over-refusal).
  • Check system-prompt leakage and cross-user data exposure.
  • Test indirect injection via retrieved content/tool output, not just direct user input.
  • Layer guardrails (input/output filters) and test them too.

Common mistakes

  • Testing only direct prompts, missing indirect injection via documents/tools.
  • Measuring refusals but not over-refusal (a usability failure).
  • One-off testing instead of a regression set re-run on every change.
  • Treating guardrails as untestable.
  • Doing aggressive red-teaming outside authorized scope.

// Related resources