Over-Refusal

AI & LLM Testing

// Definition

When an LLM declines to answer a legitimate, benign request because its safety training incorrectly classifies it as harmful. Examples: refusing to explain how a lock mechanism works, declining to write a villain character in fiction, or blocking a security question from a penetration tester. Over-refusal degrades product quality by making the model unreliable for real use cases. A safety test suite must measure both failure directions: harmful outputs (safety failures) and unhelpful refusals (over-refusal). The acceptable operating point trades off between the two.

// Related terms