Edge case discovery with AI

9 min read · Reviewed May 2026 · edge cases

Humans write the happy path and three or four obvious negatives. AI generates fifty boundary inputs you would never think of — null mid-string, RTL Arabic in an LTR form, leap-year date arithmetic, integer overflow at the API boundary. The work is then sorting useful adversarial inputs from noise. That sorting discipline is what separates a team that runs fifty AI-generated edge cases and finds three genuine defects from a team that runs fifty AI-generated edge cases and finds none, because the inputs never reached the application logic.

READ TIME9 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNSix categories of edge case that AI generates well, and the review discipline that distinguishes useful adversarial inputs from noise.

Six categories of edge case

Category-specific prompting distributes coverage across input-space dimensions AI handles well.

Six categories of edge case map well to LLM generation capabilities: boundary values, null and missing fields, unicode and encoding anomalies, locale-specific formats, race conditions and timing, and adversarial inputs. Each category represents a distinct failure mode that manual case design commonly misses — either because the tester does not think of it, or because generating exhaustive boundary values for a single field takes longer than the time available.

AI generates coverage across all six categories in seconds. The quality is uneven — adversarial inputs tend to be more creative than necessary, boundary values tend to be more thorough than a human would produce — but the overall set represents a coverage floor that manual case design rarely achieves in comparable time.

Six edge-case categories AI generates well

How to ask for them

Category-specified prompts produce distributed coverage; unspecified prompts produce a single bucket.

The quality difference between a naive edge-case prompt and a category-specified prompt is substantial. A naive prompt — "give me edge cases for this field" — produces a list that clusters around the most obvious boundaries and ignores the locale, unicode, and race-condition categories entirely. The LLM treats "edge case" as a single concept and optimises for surface coverage.

Specifying categories explicitly distributes generation across the full input space. The model generates against each category label rather than guessing what you consider an edge case, and the resulting set covers dimensions that a naive prompt would miss entirely. The prompt pattern below produces a set that exercises all six categories with a prescribed distribution.

# Edge case generation — category-specified prompt
# Field: email VARCHAR(255), required, must be a valid RFC 5321 address

Give me 20 test inputs for this email field.
Distribute them across these specific categories:

Boundary (5 inputs): empty string, max-length (255 chars),
  exactly valid minimal address (a@b.io), near-boundary lengths

Null & missing (3 inputs): null, undefined, whitespace-only string

Unicode & encoding (5 inputs): non-ASCII local part, RTL script,
  emoji in local part, combining diacritics, surrogate pair

Locale (4 inputs): Japanese full-width chars, German umlaut,
  French accents, Arabic characters

Adversarial (3 inputs): SQL injection fragment, XSS payload,
  SMTP injection via newlines

Format: one input per line, category label as prefix.
Include expected validation result (valid/invalid). No commentary.

Category-specified edge case prompt — produces distributed coverage across six input-space dimensions

The signal-to-noise problem

Sixty to eighty percent useful — treat AI-generated edge cases as a starting set, not a final test suite.

A typical AI-generated edge-case set for a complex field contains roughly fifty inputs. Around thirty will be genuine test candidates that reach your application logic. Approximately fifteen will be invalid before they reach any code — the field validation rejects them at the boundary. Five will be duplicates with different surface forms: the SQL injection attempt looks different from the XSS payload, but both exercise the same application-layer sanitisation path.

The two-pass review process handles this efficiently. Run the generated set through schema validation first — automated, takes seconds, discards the fifteen invalid inputs without human review. Then scan the remaining thirty-five for duplicates and clustering. The human review step is proportionate to the value it adds: evaluating thirty-five inputs for genuine coverage is a ten-minute task, not a morning.

// NOTE

AI-generated edge cases are a starting set, not a final test suite. Treat them like a code review: sixty to eighty percent useful, the rest discarded or refined. Schema validation before human review eliminates the lowest-value inputs automatically.

Categories AI handles badly

Domain-specific rules, state machines, and performance boundaries require human case design.

AI generates input-space coverage well. It generates state-space coverage badly. The three categories where AI-generated edge cases are consistently low-quality are domain-specific business rules (the model does not know what your application does), multi-step state machines (the model cannot reason deeply about valid state transitions), and performance and infrastructure boundaries (the model has no knowledge of your infrastructure constraints or production load profiles).

For these categories, the pattern that works is AI-assisted case design rather than AI-generated cases. Describe the business rule or state machine to the model and ask it to identify which transitions are underspecified or ambiguous — then a human designs the test cases against those ambiguities. The model is useful as a sounding board for incomplete specifications; it is not useful as a generator of test cases for business logic it cannot observe.

// PRODUCTION

Use AI for input-space coverage. Keep humans on state-space coverage and business-rule coverage. The combination is stronger than either alone — AI handles the volume of boundary and encoding cases that humans miss; humans handle the semantic cases that AI cannot reach.

Related glossary terms

Large Language Model (LLM) →