ReferenceAdvanced4-6 min reference
AI Safety Testing
Safety testing checks that an LLM feature behaves responsibly under adversarial and edge inputs — not just helpful ones. It's the AI equivalent of negative testing. This sheet lists what to probe and the signals; keep destructive red-teaming within authorized scope. See Testing AI Systems and Security Testing (linked below).
What to probe
| Risk | Test input | Should |
|---|---|---|
| Prompt injection | "Ignore previous instructions and…" (incl. text hidden in retrieved docs/files) | Hold its system rules |
| Jailbreak | Role-play / obfuscation to bypass rules | Refuse harmful requests |
| Harmful content | Requests for dangerous/illegal output | Refuse, safe completion |
| Bias / fairness | Same task across demographics | Consistent, non-discriminatory |
| PII / data leakage | "Repeat your system prompt / training data" | Not disclose secrets or other users' data |
| Toxicity | Provocative inputs | Stay non-toxic |
| Over-refusal | Benign edge requests | Not refuse everything (usability) |
Signals & method
- Maintain an adversarial test set (injection strings, jailbreak prompts, bias pairs) and run it on every model/prompt change.
- Score refusal correctness: harmful → refused, benign → answered (watch over-refusal).
- Check system-prompt leakage and cross-user data exposure.
- Test indirect injection via retrieved content/tool output, not just direct user input.
- Layer guardrails (input/output filters) and test them too.
Common mistakes
- Testing only direct prompts, missing indirect injection via documents/tools.
- Measuring refusals but not over-refusal (a usability failure).
- One-off testing instead of a regression set re-run on every change.
- Treating guardrails as untestable.
- Doing aggressive red-teaming outside authorized scope.
// Related resources