Prompt Engineering

Prompt Engineering for Testing.

Write prompts that get reliable, reviewable results. A practical guide for QA engineers: how to structure AI instructions so the output is actually useful — and safe to use.

The AI Prompt Library gives you ready-to-use prompts — this page teaches you how they work and how to write your own. Good prompt engineering is the difference between AI output you can review and own, and output that looks plausible but misleads you.

What prompt engineering means for QA

Prompt engineering is the practice of crafting structured, clear instructions that guide an AI model toward reliable, useful output. It is not about tricking the model or unlocking hidden capabilities — it is about communicating precisely, the same way a good test case communicates precisely.

QA engineers already think this way. A test case is a structured instruction: preconditions, action, expected result. A prompt is the same idea applied to an AI: role, context, task, expected output format. The skills transfer directly.

The key mindset shift: AI output is always a draft. A large language model generates statistically plausible text — it does not reason, does not know your system, and can hallucinate confident-sounding claims. Your job is to specify precisely what you need, then review and own the result.

AI is an assistant, not an authority. Every prompt produces a draft that requires human review. Verify selectors, test logic, API paths, and factual claims before using any output. Never paste credentials, API keys, tokens, or customer data into AI tools.

Anatomy of a good testing prompt

A reliable testing prompt has six components. You do not always need all six, but each one you omit makes the output less predictable.

Component 1

Role

Set the model's frame of reference. "You are a senior QA engineer specialising in API testing" produces different output than an unframed prompt. The role constrains tone, vocabulary, and assumptions.

You are a senior QA engineer with experience in Playwright and REST API testing.

Component 2

Context

Describe the system, feature, and tech stack. What does the feature do? Which framework? Which language? What does the API return? Without context the model invents plausible-sounding but wrong details.

Feature: checkout flow on a React SPA. Backend: Node.js API, PostgreSQL. Auth: JWT. Test framework: Playwright TypeScript.

Component 3

Inputs

Provide the specific artifact — acceptance criteria, an OpenAPI spec fragment, a code snippet, a bug description. Paste only what is safe; never include credentials, tokens, or PII.

Acceptance criteria: "Users can filter products by price range using a dual-handle slider. Min: £0. Max: £500. Step: £1."

Component 4

Constraints

Define scope boundaries. Which language? Which framework? What to exclude? Constraints prevent the model from solving a different problem than the one you asked.

Use Playwright TypeScript only. Do not include component tests. Focus on E2E scenarios only.

Component 5

Output format

Specify the exact format — markdown table, numbered list, code block, JSON. Without this the model defaults to verbose prose that is hard to act on.

Output: markdown table with columns — Test ID | Scenario | Input | Expected result | Priority.

Component 6

Assumptions & review checklist

Ask the model to surface its uncertainty and help you verify output. These two meta-requests are the most consistently underused — and the most valuable.

List any assumptions you made. Add a 5-item checklist for a QA engineer to verify this output.

Providing context effectively

Context is the biggest lever for improving output quality. More relevant context → more relevant output. But the context window has limits, and the sensitivity of what you paste is a security responsibility.

Useful context to include

Tech stack and testing framework (language, library, versions if relevant)
Feature description or acceptance criteria (what the feature does, not just its name)
API response shape (status codes, key response fields)
User roles and permission model, if relevant to the task
Existing test patterns in your repo (a short example helps the model match your style)
CI constraints (runner, parallelism, retry settings)

What to never include

Passwords, API keys, tokens, or any credentials
Customer data, PII, or personal information of any kind
Database connection strings or environment configuration
Anything your organisation classifies as confidential or restricted

AI tools may log, store, or train on prompt content. Treat prompts as you would a public message. See also: prompt injection — a risk when including untrusted user-generated content in prompts.

Defining output format

The single easiest improvement to any prompt is specifying the output format. Without it the model defaults to a verbose essay. With it, you get something you can directly act on.

Test cases

Edge cases

Output: numbered list. One edge case per line. No explanations.

Playwright test code

Output: TypeScript using test.describe / test blocks. Use Playwright's built-in request fixture. Descriptive test names.

Bug report

API test checklist

Output: checklist in markdown. Group by: happy path, negative cases, boundary values, auth.

Be precise. "Return as a table" is better than nothing. "Return as a markdown table with these columns: Test ID | Scenario | Input | Expected result" is better still.

Requesting edge cases & negative scenarios

AI models are trained on text that skews toward successful outcomes. Ask for "test cases" without qualification and you will get mostly happy-path scenarios. You have to explicitly request negative testing and edge cases — they do not appear by default.

Categories to request explicitly

Boundary values — exactly at limits, one over, one under
Invalid inputs — wrong type, wrong format, malformed data
Empty and null states — empty strings, null, undefined, empty arrays
Auth and permission failures — missing token, expired token, wrong role
Concurrent access — simultaneous requests from the same or different accounts
Network and service failures — timeouts, 500 responses from upstream dependencies
State-dependent scenarios — actions on already-deleted, locked, or archived resources

Add to any test case prompt

Also include negative scenarios, boundary value tests, and at least two auth/permission failure cases. Flag any gaps in the spec that prevent you from writing complete edge cases.

Asking for assumptions & a review checklist

These two additions to the end of any prompt are the most underused — and among the most valuable.

Surfacing assumptions

When context is incomplete — which it always is — the model fills gaps with assumptions. Without asking, those assumptions are invisible, embedded in the output as if they were confirmed requirements. Adding List any assumptions you made forces them into view so you can validate or correct them.

Also useful: Flag any ambiguities in the spec. AI is good at spotting contradictions and under-specified cases in acceptance criteria — something you can use in refinement sessions.

Requesting a review checklist

Ask the model to help you verify its own output: Add a 5-item checklist for a QA engineer to verify this output is correct. This gives you a quick validation guide and prompts the model to self-check before answering — which tends to improve output quality.

Standard closing to add to any prompt

List any assumptions you made.
Flag any ambiguities in the spec.
Add a 5-item review checklist for a QA engineer to verify this output.

Iterating on prompts

The first output is a starting point. Plan to iterate — either by refining the prompt itself or by following up in the same conversation.

In-conversation refinement

"The test cases are too high-level. Add specific input values and the exact API response for each."
"Focus only on the authentication flow — remove the registration scenarios."
"Convert the table to Playwright test.describe blocks using the test IDs as test names."
"Re-check the boundary values for quantity — you used 100 but the spec says max is 99."

Hallucination risk in long sessions

As a conversation grows, the model's reliability can drift. It may lose track of earlier context, contradict prior output, or fill gaps with plausible-sounding fabrications. Re-anchor context periodically: summarise what has been decided, paste the spec fragment again, or start a fresh conversation with a refined prompt based on what you learned. See also: evaluation datasets — useful for testing whether a prompt produces consistent output across runs.

Saving good prompts

When you find a prompt pattern that works reliably, save it. A team prompt library reduces time spent re-deriving the same framing. The AI Prompt Library gives you a starting set of ready-made patterns across 10 testing categories — including the system prompt patterns used in repo assistant instructions.

Common prompting mistakes

1Too vague

"Write tests for the login page" gives the model nothing: no framework, no feature details, no output format. The result is generic and mostly unusable. Always include context and format.

2No output format

Without a specified format, models produce verbose prose. You then have to reformat manually — the most time-consuming part. Specify the exact table columns, code style, or list structure upfront.

3Asking too many things at once

Bundling 'write test cases, generate Playwright code, identify gaps, suggest automation priorities' into one prompt produces unfocused output for each part. Break complex tasks into sequential prompts.

4Pasting credentials or customer data

This is a security and compliance violation. Never paste passwords, API keys, tokens, database connection strings, or any personal data into an AI tool. Use placeholder values in prompts — {API_KEY}, {USER_EMAIL} — and fill them in locally.

5Accepting output without review

AI can hallucinate selector values, invent API paths that don't exist, misread boundary conditions, and miss entire test categories. Every output is a draft. Verify selectors, check paths against the real API, and cross-reference logic against the spec before using anything.

6Rebuilding context every time

Re-deriving the same context in every prompt is wasteful. Create a reusable 'context block' (tech stack, framework, existing patterns) that you paste at the top of each new session, or use conversation threading for related prompts.

7Asking for "thorough" coverage instead of specifying it

"Be thorough" means nothing to a model. "Cover boundary values, auth failures, empty states, and concurrent access" is specific and actionable. Explicit categories produce explicit coverage.

Prompt quality checklist

Before sending a prompt, run through this list. Each unchecked item is a likely source of poor output.

Role is defined ("You are a senior QA engineer…")
Context is provided: feature behaviour, tech stack, testing framework
No credentials, PII, or sensitive data included
Output format is explicitly specified (table columns, code style, list structure)
Edge cases and negative scenarios are explicitly requested
Boundary values are explicitly requested
Auth and permission failure scenarios requested if relevant
Scope is constrained — not asking for too many things in one prompt
Assumptions are requested ("List any assumptions you made")
Review checklist is requested ("Add a 5-item checklist to verify this output")
Output will be reviewed by a human before use in production
Sensitive information — credentials, customer data — kept out of the prompt

Worked examples: poor vs improved

Each pair shows the same task framed poorly then improved. The note explains why the improved prompt produces more useful output.

Feature spec → test cases

Poor prompt

Write tests for the login page.

Improved prompt

You are a senior QA engineer.

Feature: Email + password login at /login on a React SPA. The API returns a JWT on success (200) and 401 on invalid credentials. There is a "Remember me" checkbox that persists the session for 30 days.

Task: Generate a test case list covering:
- Happy path (valid credentials → redirect to /dashboard)
- Invalid credentials (wrong password, unknown email, correct email wrong case)
- Empty fields (each field blank; both fields blank)
- Remember me behaviour (session persists / expires without it)
- Account lockout after 5 consecutive failures

Output: markdown table — Test ID | Scenario | Input | Expected result | Priority (P1/P2/P3).
List any assumptions you made about system behaviour.

The improved prompt names the tech stack, the API contract, and a specific behaviour (remember me), and requests a structured table with defined columns and priorities. The poor prompt gives the model nothing to work with.

Acceptance criteria → edge cases

Poor prompt

What edge cases should I test?

AC: Users can upload a profile photo.

Improved prompt

You are a QA engineer specialising in file upload testing.

Acceptance criteria: "Users can upload a profile photo. Supported formats: JPEG, PNG, WebP. Maximum size: 5 MB. Image is resized to 200×200 px and stored in S3."

Generate edge cases covering:
- File format boundaries (valid: JPEG/PNG/WebP; invalid: GIF, PDF, SVG, HEIC, zero-byte file)
- Size boundaries (exactly 5 MB, 5 MB + 1 byte, 0 bytes)
- Filename edge cases (spaces, unicode characters, very long filename, no extension)
- Concurrent uploads from the same account
- Upload interruption mid-transfer

Output: numbered list. Flag any AC ambiguities as assumptions.
Add a 5-item review checklist for a QA engineer to verify edge case coverage.

The poor prompt gives a single AC line with no format and no instruction to go beyond the happy path. The improved version scopes to upload testing, requests specific edge case categories, and asks for a review checklist.

Bug notes → structured bug report

Poor prompt

Write a bug report: search doesn't work when typing fast.

Improved prompt

You are a QA engineer writing a structured bug report.

Observation: When a user types more than ~4 characters per second in the search input, the results list shows results from a previous query instead of the current one, or shows no results when results are expected. Typing slowly does not reproduce the issue. Affects Chrome 124 and Safari 17 on macOS 14.

App context: React SPA, search input debounced at 300 ms, results fetched from GET /api/search?q={query}.

Write a bug report with:
- Summary (one sentence)
- Severity and priority (justify both briefly)
- Environment (browser, OS, app version placeholder)
- Steps to reproduce (numbered, with specific keystroke timing)
- Expected result
- Actual result
- Root cause hypothesis (debounce race condition?)
- Areas for a developer to investigate

Mark hypotheses clearly — do not present them as confirmed facts.

The poor prompt gives vague symptoms. The improved version provides reproduction conditions, a plausible hypothesis, and explicitly asks the model to separate facts from hypotheses — reducing the risk of a bug report that looks authoritative but contains guesses.

OpenAPI spec → API test suite

Poor prompt

Write API tests for my endpoint.

POST /api/orders

Improved prompt

You are a senior QA engineer writing API tests in Playwright TypeScript using the built-in APIRequestContext (request fixture).

Endpoint spec:
POST /api/orders
Auth: Bearer token in Authorization header (required)
Request body (JSON): { "productId": string (required), "quantity": integer 1–99 (required), "couponCode": string (optional) }
Responses: 201 { "orderId": string, "total": number } | 400 validation error | 401 missing/invalid token | 422 insufficient stock

Generate tests covering:
- Happy path (valid body + valid token → 201, response shape matches spec)
- Missing required fields (productId missing, quantity missing, both missing)
- Boundary values on quantity (0, 1, 99, 100)
- Invalid coupon code
- Missing Authorization header → 401
- Malformed token → 401

Output: TypeScript using test.describe / test blocks with descriptive names. Add a comment above each describe block explaining the contract property it verifies. Use only Playwright's built-in request — no external HTTP libraries.

List assumptions about test setup (base URL env var, token provisioning).

The poor prompt gives a half-line spec and no framework. The improved prompt pins the tool, provides the full response contract, and requests boundary and auth tests explicitly — categories AI skips by default.

Framework design ask

Poor prompt

Help me design a test framework.

Improved prompt

You are a senior test automation architect.

Context:
- App: React/TypeScript SPA + Node.js REST API + PostgreSQL
- Team: 3 QA engineers, 8 developers, CI on GitHub Actions
- Current state: ~50 Playwright E2E tests, no API test layer, no component tests
- Key pain: E2E suite averages 4 min and has 10–15% flake rate from waiting on full UI render
- Goal: reduce execution time by 50% and add an API test layer within 3 months

Design a 3-layer test pyramid (component, API, E2E). For each layer specify:
- Recommended tool (justify in one sentence)
- What to test at this layer and NOT at others
- Target test-count ratio relative to the current 50 E2E tests
- CI trigger strategy (on every PR / on merge / nightly)

Output: structured markdown with a heading per layer, then a short "migration sequence" section.
Flag any architectural assumptions you make about the team's TypeScript fluency or CI resources.

The poor prompt is open-ended with no constraints. The improved version gives team composition, current pain, goal, and requests structured output with justification and flagged assumptions — essential when AI is recommending a multi-month architecture.

Next steps

Apply the skill

The AI Prompt Library contains 46 ready-made prompts across 10 testing categories — test cases, automation scripts, bug reports, API tests, accessibility, security, and more. Each prompt is built on the structure this guide describes.

AI Prompt Library AI Testing Hub Glossary: Prompt Engineering Glossary: LLM Glossary: Hallucination

Glossary terms used in this guide

Prompt Engineering LLM Hallucination Context Window System Prompt Prompt Injection Negative Testing Validation Evaluation Dataset