Project Brief — Build an AI-Augmented QA Workflow

You have spent four chapters mapping the landscape, the categories, the tools, and the workflows. The capstone pulls everything together into a single realistic exercise: design and roll out an AI-augmented QA practice for a working team. The brief below is the project. The walkthrough in the next lesson is one credible answer; yours can differ — the point is that you can defend your choices.

The scenario — FlexBank

You are the QA Lead at FlexBank, a mid-sized digital bank. Your team is eight QA engineers. The current setup:

A Selenium + TestNG suite of around 1,500 tests covering the web app.
Postman collections for API testing, around 400 requests.
Manual exploratory testing on each release.
One manual visual review per release ("does the UI look right on Chrome and Safari?").

The CTO has been hearing about AI in QA for two years and finally wants the team to do something about it. She has approved a $5,000/month budget for AI tooling and asked for a 90-day pilot with concrete results to report to the board.

The pain points (verbatim from your stakeholders)

"Test authoring is slow — about three days per new feature, and we ship every two weeks."
"Bug triage is two weeks behind. We have a backlog of 80 unread reports."
"8% of our CI runs fail on flaky tests. Engineers re-run by reflex."
"Manual exploratory testing on each release takes three days from two senior testers."
"Visual regressions slip into production occasionally — we missed a broken checkout button last month for 11 hours."

These are the pain points the CTO will measure you against. Whatever you adopt should move at least three of these numbers in 90 days, or the pilot fails.

What you will deliver

Six artefacts. The walkthrough lesson works through one credible answer for each — yours should be specific to your real team if you have one, or to FlexBank if you're using this as a pure exercise.

1. Tool selection document

For each of the six tool categories from Chapter 1, decide:

Coding assistant. GitHub Copilot vs Cursor — pick one, give the rationale.
AI chat assistants. ChatGPT Team vs Claude Team vs both — for whom and at what tier.
Self-healing or AI test platform. Mabl trial vs Healenium open-source — which to pilot first.
Visual AI tool. Applitools trial vs Percy — which fits FlexBank's surface area.
MCP-based exploration. Playwright MCP is free; what would you actually use it for at FlexBank?
AI analysis tools. Datadog Watchdog or similar — in scope or out of scope for this pilot?

For each: tool, cost, what it covers, what it doesn't, expected impact on the FlexBank pain points. One paragraph each, plus a one-page summary table.

2. Pilot plan — three 30-day phases

A 90-day rollout structured as three phases. The structure that works for most teams:

90-day pilot structure

– Adopt coding assistants for the whole team
– Roll out chat AI for design and triage
– Capture baseline metrics
– Document team norms

– Self-healing on legacy Selenium suite
– Visual AI on marketing + checkout pages
– MCP-driven exploratory loop on 2 engineers
– Measure each pilot weekly

Roll winning tools out to whole team –
Drop tools that didn't deliver –
Publish prompt library and norms –
Report to CTO with recommendations –

3. Success metrics

Five metrics, with baselines and 90-day targets. Realistic targets — overpromising is how pilots fail to land:

Time per new test. Baseline: 3 days. Target: ?
Bug triage cycle time. Baseline: 2 weeks behind. Target: ?
Flake rate. Baseline: 8%. Target: ?
Engineer satisfaction. Baseline: TBD survey. Target: ?
Cost per test execution. Baseline: ~$0 (Selenium on owned infra). Target: ?

The cost-per-test number matters because some AI tools (managed platforms, LLM tokens for MCP) have variable cost — and the CTO will ask.

4. Training plan

How will you onboard the team to the new tools? At minimum:

Week 1 launch session: live walkthrough of each tool, common usage patterns, governance norms.
Week 2-4: paired-prompting sessions — engineers prompt together, share what worked.
Week 5+: an internal "prompt library" repo. Each engineer contributes at least three prompts they use weekly.
Monthly: a 30-minute team retro on AI tool usage. What's working? What's not?

5. Governance

A one-page document covering:

Data privacy. What code, test data, customer data may be sent to AI providers? What enterprise tiers are required to make sensitive data acceptable?
Code review. AI-generated code is reviewed exactly like human-written code. No special bypass.
Prompt library. Where it lives, who curates, how prompts are added.
Reproducibility. When does AI output need to be deterministic (e.g., generated fixtures), and how do you pin model versions where needed?
Incident handling. If an AI-generated test masks a bug that ships to production, what's the response?

Short, agreed once, prevents many awkward conversations later.

6. Risk mitigations

Four risks the CTO will ask about:

Vendor lock-in. Mitigation: prefer tools you can leave (Healenium open-source over Mabl) where possible; for managed platforms, demand export options at sign-up.
LLM API costs scaling unexpectedly. Mitigation: per-engineer monthly cap, cost dashboard, alerts at 50/80/100% of budget.
Team resistance / fear of AI. Mitigation: frame AI as augmentation, not replacement; celebrate AI power users; address job-security concerns directly and honestly.
Quality concerns — AI bugs in production. Mitigation: human review of all AI-generated code, baseline metrics on defect-escape rate, rollback plan if escape rate worsens.

The constraints

Budget: $5,000/month maximum, including LLM API costs.
Timeline: 90 days, with mid-pilot checkpoints at day 30 and day 60.
Team size: 8 QA engineers, including you. No new hires for the pilot.
Existing infra: Selenium + TestNG, Postman, internal CI on GitHub Actions.
Stakeholders: CTO (sponsor), VP Engineering (skeptic), three engineering team leads (mixed).

What "good" looks like at day 90

At least three of the five pain-point metrics have improved measurably.
Tools that didn't deliver have been dropped without drama.
The team has clear written norms and a prompt library.
Engineers are using AI tools daily, by choice, not by mandate.
The CTO has a one-page summary of what worked, what didn't, and what to do next.

How to use this brief

If you're working in a real team, adapt FlexBank's setup to your reality. Replace Selenium with Cypress or Playwright. Replace Postman with REST Assured. The structure of the exercise stays the same.
If you're studying solo, treat FlexBank as a thought experiment. Write the artefacts as if you were going to ship them. Show them to a colleague for a sanity check.
Either way: spend more time on the rationale than on the tool selection itself. The CTO already trusts that there are tools — what she wants from you is judgement.

The next lesson walks through one credible answer to all six artefacts, with the trade-offs explicit. Read it after you've drafted your own answer — comparing the two is where the learning lives.