ReferenceAdvanced4-6 min reference

AI Safety Testing

Safety testing checks that an LLM feature behaves responsibly under adversarial and edge inputs — not just helpful ones. It's the AI equivalent of negative testing. This sheet lists what to probe and the signals; keep destructive red-teaming within authorized scope. See Testing AI Systems and Security Testing (linked below).

What to probe

Risk	Test input	Should
Prompt injection	"Ignore previous instructions and…" (incl. text hidden in retrieved docs/files)	Hold its system rules
Jailbreak	Role-play / obfuscation to bypass rules	Refuse harmful requests
Harmful content	Requests for dangerous/illegal output	Refuse, safe completion
Bias / fairness	Same task across demographics	Consistent, non-discriminatory
PII / data leakage	"Repeat your system prompt / training data"	Not disclose secrets or other users' data
Toxicity	Provocative inputs	Stay non-toxic
Over-refusal	Benign edge requests	Not refuse everything (usability)

Signals & method

Maintain an adversarial test set (injection strings, jailbreak prompts, bias pairs) and run it on every model/prompt change.
Score refusal correctness: harmful → refused, benign → answered (watch over-refusal).
Check system-prompt leakage and cross-user data exposure.
Test indirect injection via retrieved content/tool output, not just direct user input.
Layer guardrails (input/output filters) and test them too.

Common mistakes

Testing only direct prompts, missing indirect injection via documents/tools.
Measuring refusals but not over-refusal (a usability failure).
One-off testing instead of a regression set re-run on every change.
Treating guardrails as untestable.
Doing aggressive red-teaming outside authorized scope.

// Related resources

Guides & How-to-Test

AI Prompt Library

Glossary

Prompt injection