ChatGPT and Claude for Test Design

Coding assistants help you type tests faster. General-purpose chat assistants — ChatGPT, Claude, Gemini — help you do the parts of QA that aren't typing at all: designing test plans, generating realistic data, debugging cryptic errors, drafting bug reports, learning unfamiliar APIs. Most QA engineers underuse this category, and it's often where the biggest time savings live.

Test design — enumerating cases at speed

A senior tester can list 15 edge cases for a date picker if you give them an hour. A chat assistant can do it in 30 seconds. Neither output is "the answer" — it's a starting point. But the assistant's enumeration is broader than any individual brain dump, and reviewing 15 suggestions is much faster than producing 15 from scratch.

Useful prompts:

"What edge cases should I test for a date picker that accepts YYYY-MM-DD?"
"Generate test scenarios for a login form: 1 happy path + 10 negative scenarios."
"I need to test a search feature. What types of queries should I cover — empty, very long, special characters, injection attempts, Unicode, near-misses?"
"Help me design a test plan for a checkout flow. Group scenarios by risk level."

The output is rarely 100% relevant — but the 60% that is saves hours.

Test data generation

Realistic test data is tedious to write by hand and easy to generate with a chat assistant.

Generate 20 test users in JSON format. Realistic UK names, varied emails
(some with +tag aliases), London-area addresses, mobile phone numbers in
07XXX XXXXXX format. Include a mix of ages 18-75.

Create CSV test data for an e-commerce site: 50 products. Columns: id,
name, category, price (£), stock_count, is_active. Categories: Electronics,
Clothing, Home, Books, Sports. Mix of in-stock and out-of-stock; mix of
active and inactive. Include some realistic edge cases (price = 0.01,
stock_count = 0, very long product names).

The output is good enough to seed a database, parametrise a test, or fill a fixture file — instantly. If you need true data variety, generate in batches and merge.

Bug analysis

Paste a stack trace plus relevant code, ask "what could cause this." The model groups noise, names the probable culprit, and suggests where to look next. It won't always be right, but it will narrow your search — and that's the most expensive part of debugging.

A second use is converting a vague bug report ("the search is broken sometimes") into reproduction steps:

A user reported: "Search returns no results for valid product names sometimes."
Help me clarify this. List 8 possible interpretations of what they actually
experienced, with reproduction steps for each. Prioritise by likelihood.

Output is a triage cheat-sheet you can work through with the reporter.

Documentation drafts

"Convert this test code into a Markdown test plan for product owners."
"Document this Cypress custom command in JSDoc format."
"Summarise the changes in this PR for the release notes — focus on user-visible behaviour."

First drafts in seconds. Edit to taste.

ChatGPT vs Claude vs Gemini

The honest answer: try them all on real tasks, pick by feel. Quality differences exist but are small and shift with each new model release. Practical considerations:

ChatGPT (OpenAI). Most popular, broadest plugin ecosystem, excellent for general queries. Custom GPTs let you save reusable prompts.
Claude (Anthropic). Strong long-context handling — paste a 30k-line file and it stays coherent. Often preferred for code reasoning and longer artefacts.
Gemini (Google). Deep Google ecosystem integration; strong multimodal (image understanding) features.
Mistral. European, open-weights options; useful where data residency or self-hosting matters.

Pricing is roughly $20/month/user across the board for the paid tier. Free tiers are usable for casual work but rate-limited and on smaller models.

Where each helps most across the QA workflow

Approximate speed-up from AI chat assistants on common QA tasks

Test design / scenario enumeration10x faster

Test data generation20x faster

Bug triage / log analysis5x faster

Documentation drafts5x faster

Learning unfamiliar APIs3x faster

Numbers above are rough — they reflect what teams who've adopted chat assistants commonly self-report, not benchmark studies. Your mileage will vary by domain, but the rank order is fairly stable: test data and design see the largest speed-ups; learning and triage see solid but smaller ones.

A worked example — comprehensive scenario generation

I need to test a "Forgot password" flow. The user enters their email, gets
a reset link, clicks it, and sets a new password.
 
Generate:
1. Happy path scenario.
2. Negative scenarios (8-12 cases covering input validation, expired
   links, rate limiting, security edge cases).
3. Edge cases (international characters, very long passwords, password
   reuse against history, simultaneous reset requests).
4. Format each as a Gherkin scenario (Given / When / Then).

The output is a comprehensive test plan that would take a senior tester an hour to think through manually. Your job is now editing rather than authoring — much faster, and you catch the assistant's mistakes (irrelevant scenarios, missed domain rules) as you go.

Limitations

Tests are generated, not verified. The model has no way to check whether a scenario applies to your actual product. You do.
Domain blind spots. It doesn't know your business domain. Always feed context: "This is a UK pension product subject to FCA rules."
Confident hallucinations. Made-up library APIs, plausible-but-wrong regex, fabricated regulatory references. Read carefully.
Stale knowledge. Models have a training cutoff. For new framework features, double-check against the official docs.

Building a personal prompt library

Once a prompt works well, save it. A simple Markdown file or a shared team repo of "prompts that work" pays back fast. Most experienced QA engineers settle on 10–20 reusable prompts they tweak each time rather than re-inventing.

⚠️ Common Mistakes

Treating chat output as a finished test plan. It's a draft. The 30 seconds it took to generate doesn't mean you can skip the 10 minutes to review.
No context. "Generate test cases for my login" gets generic output. "Generate test cases for a UK banking login subject to FCA strong customer authentication" gets useful output.
Single-prompt expectation. Real prompts iterate: ask, refine, narrow, expand. Three rounds of tightening usually beats one perfect prompt.

🎯 Practice Task

30 minutes.

Pick a feature in your product. Open ChatGPT, Claude, or Gemini.
Ask for 10 test scenarios. Read the output critically.
Refine the prompt with more context — domain rules, constraints, audience.
Repeat. Note how the output changes.
Save your final prompt as the first entry in a personal prompt library.

Next lesson: principles of prompt engineering for test automation specifically.