On this page9 sections
ConceptsBeginner7-9 min reference

Agile Testing

How testing actually works inside an Agile team — what QA does in each ceremony, how to size effort per story, what "done" means, and how the practice extends from CI into production.

Agile Testing Principles

PrincipleWhat it looks like in practice
Testing is continuous, not a phaseTests run every commit; QA pairs with devs all sprint, not at the end
Quality is the whole team's responsibilityDevs write unit + integration tests; product owns acceptance criteria; QA orchestrates and explores
Fast feedback over comprehensive documentationA bug raised in standup beats a 4-page report a week later
Working software over extensive test plansRun the feature, even half-built, instead of waiting for a "complete" spec to plan against
Respond to change over following the planRe-prioritise tests when scope shifts mid-sprint; the plan serves the work, not the other way around
Prevention over detectionCatch issues at the requirements / design stage — cheaper than catching them in QA, much cheaper than in production
Shift-left — QA involved earlyReview acceptance criteria, attend design reviews, review PRs, write tests before code

Agile Testing Quadrants (Brian Marick)

A model for thinking about what kind of testing you're doing and why. Two axes: business vs technology, and supporting the team vs critiquing the product.

                  Supporting the team
                          │
   ┌──────────────────────┼──────────────────────┐
   │                      │                      │
   │         Q2           │         Q3           │
   │    Functional        │   Exploratory        │
   │    User stories      │   Usability / UAT    │
   │    Prototypes        │   Alpha / beta       │
   │                      │                      │
   ├──────────────────────┼──────────────────────┤   Business-facing
   │                                             │
   │ Technology-facing                           │
   │                                             │
   ├──────────────────────┼──────────────────────┤
   │                      │                      │
   │         Q1           │         Q4           │
   │    Unit tests        │   Performance        │
   │    Component tests   │   Security / load    │
   │    Contract tests    │   Soak / chaos       │
   │                      │                      │
   └──────────────────────┼──────────────────────┘
                          │
                  Critiquing the product
QuadrantTest typesOwnerAutomation
Q1 — technology-facing, supportingUnit, component, contractDevsFully automated
Q2 — business-facing, supportingFunctional / story tests, prototypes, examplesDevs + QAAutomated where stable, manual for new behaviour
Q3 — business-facing, critiquingExploratory, usability, UAT, alpha/betaQA + real usersManual — judgment-driven
Q4 — technology-facing, critiquingPerformance, security, load, soak, chaosSpecialists + QATool-driven, scheduled

A balanced team invests in all four. A common smell: heavy in Q1 + Q2, no Q3 (no exploratory) and no Q4 (no perf/security). Bugs sneak through the gap.

Test Pyramid

The cost-and-coverage shape of an Agile test suite. Most tests at the bottom (cheap, fast); fewest at the top (slow, expensive).

            ╱╲
           ╱  ╲       ~10%   E2E       slowest, most fragile
          ╱────╲
         ╱      ╲     ~20%   Integration
        ╱────────╲
       ╱          ╲   ~70%   Unit       fastest, cheapest
      ╱────────────╲
LayerTypical shareSpeedOwnerStrengths
Unit~70%ms — runs on saveDevsLogic errors in pure functions, edge cases, regressions in calculations
Integration~20%secondsDevs + QAComponent interactions, DB queries, API contracts, message handling
End-to-end~10%tens of secondsQAReal user flows, deploy correctness, browser-specific behaviour

Anti-patterns

Anti-patternShapeWhy it fails
Ice-cream coneTip-heavy E2E layer over a thin baseSlow CI, brittle tests, expensive maintenance, flaky signal
HourglassMany unit + many E2E, almost no integrationBig behavioural gaps — pure-logic units pass, full flows pass, but the seams between modules silently break
CupcakeDecorations on top — manual tests stacked above E2EManual regression on every release; release cadence drops below business needs

The pyramid isn't a law — for some products (libraries, pure-logic services) the right shape is even more bottom-heavy. For others (UI-heavy apps), 60/25/15 is more realistic. The point: be deliberate about the ratio, not accidental.

QA in Scrum Ceremonies

CeremonyWhat QA brings
Backlog refinementReview upcoming stories for testability — can we tell when this is done? Flag missing or vague acceptance criteria. Raise risks (data, performance, accessibility) before sizing
Sprint planningEstimate testing effort per story; identify test approach (manual / automated / both); raise dependencies (test data, third-party stubs, env access); split stories that are too big to test in-sprint
Daily standupTesting status per story; blockers (broken build, env down, awaiting fix); fresh defects worth flagging early
Sprint review / demoDemo tested features; show quality metrics (coverage, defect counts, escaped bugs); gather stakeholder feedback that becomes next sprint's input
Sprint retrospectiveProcess improvements: too much regression, slow CI, flaky environment, test-data setup pain, automation gaps. The retro is where QA practice gets better — don't sit silent

Three Amigos meeting

When a story is unclear, get a developer, a tester, and a product person together — the three amigos. The tester's role is to keep asking "what could go wrong?" and "what's the acceptance criteria for that case?" until the story is concrete enough to estimate.

Story Testing Workflow

The same-sprint flow that healthy Agile teams use. The order matters: testing tasks are spread across the sprint, not stacked at the end.

Story enters sprint
       │
       ▼
QA reviews AC ──── gaps? ──→ raise in standup / Three Amigos
       │
       ▼
QA writes scenarios (shift-left, before dev finishes)
       │
       ▼
Developer builds the feature
       │
       ▼
QA tests on dev branch or feature environment
       │
       ├──── bug found? ──→ communicate immediately (chat / pair > ticket)
       │                    └─ developer fixes ─ QA verifies
       ▼
Regression check (automated suite + targeted manual)
       │
       ▼
Story → Done (DoD met) → demo at review

What gets in the way

  • Story arrives in code review with no test scenarios. QA wasn't pulled in early — fix at refinement, not at the PR.
  • All testing happens on the last day of the sprint. Story was too big to ship + test in one sprint. Split it.
  • "It works on my machine." No shared dev/feature env, or env is broken. Treat env health as a blocker, not a fact of life.
  • Bugs filed but never fixed in-sprint. Carryover compounds. Cap WIP on bugs the same way you cap stories.

Definition of Done (DoD) — Testing Criteria

A story isn't done until everything below is true. Treat this as a checklist on the story card — paste it into the description if your tracker doesn't surface it natively.

□ All acceptance criteria verified (manual or automated)
□ Unit test coverage meets team threshold (e.g. ≥ 80 %)
□ Integration tests passing
□ Regression suite passing
□ No open critical or high severity defects
□ Performance benchmarks met (if perf-sensitive)
□ Accessibility checks passed (WCAG AA)
□ Cross-browser / cross-device tested per support matrix
□ Code reviewed and approved
□ Documentation updated (user-facing, API, runbook)
□ Telemetry / logging in place

Some teams also add: feature flag added (if behind one), translations updated, analytics event wired, security review checked off.

The exact list depends on the team — but every team should have an explicit DoD. "We'll know it when we see it" is how regressions ship.

Acceptance Criteria & BDD

INVEST — what makes a good user story

LetterMeansTester's lens
IndependentCan be developed without depending on another storyCan it be tested in isolation?
NegotiableDetail can shift during refinementAre the AC firm enough to derive cases, or still TBD?
ValuableDelivers value to a user or stakeholderCan you state the business outcome it enables?
EstimableTeam can size the effortIs testing effort included in the estimate?
SmallFits in one sprintCan I test all the AC inside the sprint?
TestableAcceptance criteria are verifiableCan I write a pass/fail test for each AC?

If you can't answer the testability question, the story isn't ready. Send it back to refinement.

Given / When / Then format

The standard structure for acceptance criteria in Agile + BDD teams. Each scenario reads as one observable outcome.

ClausePurpose
GivenPre-existing state — the world as it is before the action
WhenThe action — exactly one event that triggers the behaviour
ThenThe expected outcome — what must be true after the action
And / ButAdditional Given/When/Then clauses

Worked example

Given I am a logged-in user
And my cart is empty
When I add an item to my cart
Then the cart count should increase by 1
And I should see the item in the cart summary

Read top to bottom: the scenario is concrete, observable, and binary. The Then clauses are what the test will assert.

Multiple scenarios per story

Most stories need 3–6 scenarios — at minimum, one happy path plus the obvious failure modes.

Scenario: Add an item to an empty cart
  Given I am a logged-in user
  And my cart is empty
  When I add "Mountain Bike" to my cart
  Then the cart count should be 1
  And the cart summary should list "Mountain Bike"

Scenario: Add an out-of-stock item
  Given I am a logged-in user
  When I attempt to add an out-of-stock item to my cart
  Then I should see "Out of stock" message
  And the cart should remain empty

Scenario: Add an item while logged out
  Given I am not logged in
  When I attempt to add an item to my cart
  Then I should be redirected to the login page
  And the item should be added to the cart after I log in

Converting acceptance criteria to test cases

Each scenario in Given/When/Then form maps directly to a test case. The test runner determines the level:

AC scenario levelWhere the test runs
Pure logic / domain ruleUnit test
Service interactionIntegration test
End-to-end user flowE2E (Cypress / Playwright / Selenium)

The same Given/When/Then text can drive a manual test, a Cucumber/SpecFlow scenario, or be paraphrased into a Playwright test() block — pick the level that matches the AC's scope, not always the highest.

Automation of acceptance tests

When AC are written in Gherkin, automation is mostly glue:

ToolLanguageNative to
CucumberJava, JS/TS, Ruby, Python, othersMost ecosystems
SpecFlowC# / .NETVisual Studio
BehavePythonpytest-adjacent
Robot FrameworkPython (keyword-driven, BDD-like)Acceptance + RPA
KarateJava (Gherkin for API testing)API-first BDD

The benefit isn't speed of writing — it's that the AC become the test artefact. Product, dev, and QA all see the same Given/When/Then; nobody hand-translates between a Word doc and a code file.

The cost: discipline. If step definitions become a tangled mess of generic When I click {string} steps, you've lost the readability advantage. Keep step phrasing domain-specific, not technology-specific.

Continuous Testing

The pipeline-driven extension of Agile testing — every commit verified through a layered test suite that gets slower as confidence grows.

Every commit triggers automated tests

The pipeline runs the same tests for every PR and every merge. Local "but it works on my machine" loses to CI as the source of truth.

The standard pipeline shape

commit
   │
   ▼
┌──────────┐  fail-fast — runs in seconds
│   lint   │
└────┬─────┘
     ▼
┌──────────┐  fast — isolated, no I/O
│   unit   │
└────┬─────┘
     ▼
┌──────────────┐  medium — DB, message bus, HTTP mocks
│ integration  │
└────┬─────────┘
     ▼
┌──────────┐    slow — full browser, real services
│   E2E    │
└────┬─────┘
     ▼
┌──────────────┐  optional / scheduled — load, soak
│ performance  │
└──────────────┘

Each stage gates the next. A failure in unit aborts before E2E even starts. Cheaper failures find faster feedback.

Fast-feedback budget

StageTarget timeWhat this means in practice
Lint< 30 sPre-commit hook catches it before CI fires at all
Unit< 2 minDevs trust the green light enough to keep flowing
Integration< 5 minAcceptable to wait on a PR
E2E< 15 min totalSharded across runners; each shard < 5 min
Performancescheduled / nightlyNot blocking PRs, but visible to the team

If the unit stage takes 20 minutes, devs stop running it locally. If E2E takes 90 minutes, devs stop reading the failures. Slow tests get bypassed — speed is correctness.

Shift-right — testing extends into production

Continuous testing doesn't stop at deploy. The complement of shift-left is shift-right: learn from production.

PracticeWhat it isWhat it catches
Synthetic monitoringAutomated probes hit production from outside (Pingdom, Datadog, Checkly, Grafana k6 cloud)Outages, latency regressions, broken third-party integrations, cert expiry
Real-user monitoring (RUM)Browser SDK reports real-user load times, errors, click flowsBrowser-specific bugs, slow flows for real users on real networks
Canary deploymentsRoll new version to 1% → 10% → 50% → 100% over hours/daysRegressions visible at low blast radius before wide rollout
Feature flagsShip dark, enable for a small cohort, then everyoneTest in production safely; instant rollback without redeploy
Error trackingSentry / Rollbar / Bugsnag capture exceptions with stack + breadcrumbsBugs that don't reproduce locally; regressions that escape pre-prod tests
Chaos engineeringDeliberate failure injection — kill instances, drop traffic, slow networksResilience gaps; recovery timing assumptions

Feature flags — ship dark, then test

Decouples deploy from release. Code reaches production behind a flag; the flag stays off until tested. Switch on for QA, then internal users, then real users.

deploy (flag off, no behaviour change)
        │
        ▼
flag-on for QA-only cohort   ──────────┐
        │                              │
        ▼                              │
flag-on for 1% of real users           ├──── monitor production
        │                              │
        ▼                              │
flag-on for 100%                       │
        │                              │
        ▼                              │
remove flag from code  ◄───────────────┘

If anything goes wrong at any step: flip the flag off — no rollback, no redeploy.

A/B testing — validate with real users

Run the new version (B) against the old (A) for two cohorts of real users. Compare outcomes:

What you measureExample
Conversion% completing the funnel
EngagementTime on page, click-through
ErrorsCrash rate, validation failure rate
PerformanceLCP, INP, time-to-interactive

The QA role isn't to pick the winner — it's to make sure the experiment is measurable (instrumentation present, metrics defined, sample size adequate) and that both arms are equally tested before launch.

Agile Testing Metrics

Metrics in Agile aren't management report fodder — they're feedback for the team. Pick a small set and watch the trend, not the absolute number.

MetricDefinitionTargetSmell when
Defect densityDefects ÷ stories (or ÷ KLOC)Trends down sprint over sprintSpikes — usually a story too big or AC too thin
Escaped defectsBugs found in production that pre-prod tests missedAs close to 0 as the team can sustainTrending up — coverage gaps; review post-mortems
Defect resolution timeMean time from "reported" → "fixed and verified"< 2 days for high-severityBugs piling up — WIP-cap them
Reopened defect rate% of defects re-opened after marked "fixed"< 5%Fix verifications too shallow; missing regression coverage

Coverage metrics

MetricDefinitionWhat it actually tells you
Acceptance test coverage% of acceptance criteria with at least one automated testConfidence the AC won't regress silently
Code coverage% of source lines / branches executed by testsUseful when trending; useless as an absolute target — 100% covered code can still be untested logic
Requirements coverage% of user stories with at least one test caseHigher level than code coverage — better signal for product completeness

Code coverage as a target gameable; as a trend, it's a sensible early warning.

Velocity & process metrics

MetricDefinitionTester's read
Velocity impactHow testing effort affects team velocity per sprintIf velocity drops every time a sprint includes UI testing, the test debt is real
Sprint burndown — testing tasksTesting work as part of the sprint burndown chartTesting should burn down alongside dev — not stack at the end
Stories rolled overStories that couldn't be marked "Done" because testing wasn't completePersistent rollover means testing capacity is short of dev capacity
Cycle timeTime from "in progress" → "done" per storyIncludes testing — long cycle times often mean late testing

Automation metrics

MetricDefinitionHealthy range
Automation ratioAutomated tests ÷ total testsTrending up; the absolute % depends on the product
Automation coverage of regression suite% of regression test cases automatedHigh — manual regression is the slowest path to release
Test execution timeWall-clock time of the full automated suiteStable or shrinking; growth past the "fast feedback budget" needs sharding or pruning
Flakiness rate% of automated tests that fail on retry without code change< 1% per test, < 5% suite-wide. Above that, devs stop trusting CI
Test maintenance ratioTime spent fixing tests ÷ time writing new testsIf fixes dominate, the suite is over-coupled to UI internals — refactor

Pick the smallest set that drives action

Reporting 12 metrics nobody acts on is a bigger problem than reporting 3 you do. A practical starter dashboard:

  1. Escaped defects this release — the only one product cares about.
  2. CI build time — fast-feedback budget; team productivity.
  3. Flakiness rate — trust in the suite; if it climbs, fix it that sprint.
  4. Stories rolled over due to testing — capacity signal.

Add more only when you have a question those four don't answer.