Guided Walkthrough — Tool Selection, Integration, Measurement

12 min read

This is one credible answer to the FlexBank brief. It is one answer, not the answer — your reasoning should reflect your team's reality. The point is to show the trade-offs explicitly so you can argue with the choices and adapt them. Read this after you've drafted your own; the gap between the two is where the learning lives.

Part 1 — Tool selection rationale

A defensible budget breakdown for the $5,000/month constraint.

Coding assistant — Cursor for the whole team

Pick: Cursor Business at $40/user/month × 8 = $320/month.

Why Cursor over GitHub Copilot for FlexBank specifically: the team is refactoring a 1,500-test legacy Selenium suite. That work benefits from project-wide context (Cursor's Cmd+L with @-references) far more than from inline autocomplete. For a greenfield project, Copilot at half the price would be the right call; for a refactor-heavy team, the extra $20/user/month is the cheapest productivity win in this list.

Risk: editor migration friction. Mitigation: Cursor is VS Code-compatible, so settings and extensions carry over. Pilot on three engineers in week 1 before rolling to all eight.

Chat AI — split tier by role

Pick: Claude Team at $30/user/month × 3 leads = $90/month, plus ChatGPT Plus at $20/user/month × 5 engineers = $100/month. Total: $190/month.

Why split: QA leads spend more time on long-form artefacts — test plans, triage analyses across 50 reports, codebase-wide reasoning — where Claude's long-context handling helps. Engineers spend more time on shorter-form work where ChatGPT's broader plugin ecosystem and lower cost are fine. Both have data-handling terms that work for FlexBank's banking-domain code.

Risk: tool sprawl. Mitigation: prompt library is shared across both; engineers can use either at their preference if a particular task fits one better.

Self-healing — Healenium first

Pick: Healenium open-source. $0 in licence. Operational overhead: ~4 hours/week of one engineer for the pilot.

Why Healenium over Mabl: FlexBank already has a working Selenium suite. Healenium wraps the existing driver — no rewrites. Mabl would mean re-authoring tests in a new platform and migrating CI; that's a separate, larger project. If Healenium fails to move the maintenance metric meaningfully in 60 days, then trial Mabl in phase 3.

Risk: open-source operational burden. Mitigation: budget the 4 hours/week explicitly; if it grows beyond that, it's a signal to evaluate Mabl on cost-of-ownership terms, not just sticker price.

Visual AI — Applitools 30-day trial

Pick: Applitools 30-day free trial in phase 2. If results justify it, contract at ~$2,000/month.

Why Applitools over Percy: FlexBank's pain point includes a missed visual regression in production. Applitools has the strongest cross-browser and cross-device handling, which is what a banking app on Chrome / Safari / iPhone / Android browser needs. Percy is a viable cheaper alternative if budget tightens; for the pilot, run Applitools and measure real bugs caught.

Risk: cost shock at end of trial. Mitigation: explicit go/no-go criterion before phase 3 — at least three real visual regressions caught in 30 days, or no contract.

MCP-based exploration — Playwright MCP

Pick: Playwright MCP open-source. Cost: free, plus LLM API tokens budgeted at $500/month.

Why: FlexBank does three days of manual exploratory testing per release. Two senior engineers using a Playwright-MCP-driven exploratory loop alongside their manual session can plausibly cut that to two days within 60 days. The free tooling and modest token cost makes this the highest-ROI experiment in the pilot.

Risk: token costs running over. Mitigation: per-session cap of 200K tokens, weekly cost report, soft alert at $400.

AI analysis tools — out of scope for this pilot

Pick: defer. FlexBank's observability sits with the SRE team; pulling AI analysis tools into the QA pilot widens the scope beyond what 90 days can credibly deliver. Recommend revisiting in the next pilot.

Total monthly spend

ItemCost
Cursor Business × 8$320
Claude Team × 3 + ChatGPT Plus × 5$190
Healenium$0 (operational time)
Applitools (phase 2 onwards)$0 trial → ~$2,000 if contracted
Playwright MCP + tokens$500
Total before Applitools contract~$1,010
Total with Applitools contract~$3,010

Comfortably within the $5,000 budget, with headroom for a second pilot tool in phase 3 if the data supports it.

Part 2 — Phase 1 (days 1-30): Foundation

  • Day 1-3. Procurement: licences for Cursor, Claude, ChatGPT. Onboard accounts.
  • Day 4-5. Baseline metric capture. Time-per-new-test (sample 5 recent features), flake rate (last 100 CI runs), triage backlog count, exploratory testing duration.
  • Day 6-7. Launch session: 90-minute team session walking through each tool, governance norms, the prompt library template.
  • Day 8-21. Daily use. Engineers paired in pairs of two for the first week — one drives prompts, the other observes. Switch.
  • Day 22-28. First retro. What's working? What's not? Adjust norms. Add 5-10 prompts to the library.
  • Day 30 checkpoint. Time-per-new-test: target under 2 days (33% reduction). Triage backlog: target halved.

Part 3 — Phase 2 pilots (days 31-60)

Three parallel experiments, each owned by a different engineer:

  • Healenium pilot. Convert 100 of the most brittle Selenium tests to Healenium-wrapped. Measure: maintenance hours per week before vs after, healing-report accuracy, false-heal rate.
  • Applitools pilot. Add visual checks to homepage, login, and the four highest-traffic pages on FlexBank web. Measure: visual regressions caught in dev / staging that would have escaped to production.
  • Playwright MCP exploratory loop. Two senior engineers use MCP-driven exploration on each release alongside their manual session. Measure: bugs found via MCP that the manual pass missed, time saved, token cost.

Weekly check-ins with each pilot owner. The metric, not the activity, is what's reviewed.

Part 4 — Phase 3 scale (days 61-90)

  • Healenium. If maintenance burden dropped meaningfully, expand to all 1,500 tests. If not, retire the pilot, document why.
  • Applitools. If five+ real bugs were caught in trial, contract at $2,000/month. If fewer, drop or downscale to Percy at lower cost.
  • MCP loop. If engineers found value, document patterns and roll to all senior engineers. Draft a "FlexBank exploratory testing playbook" combining manual + MCP.
  • Prompt library. Lock down ownership. Quarterly review cadence agreed.
  • Day 90 report to CTO. One page. What worked, what didn't, what to do next, total spend, projected ongoing spend.

Part 5 — The 90-day timeline

Step 1 of 5

Days 1-7 — Setup and baseline

Procure tools, onboard accounts, capture baseline metrics, run launch session.

Part 6 — Measurement cadence

  • Weekly: dashboard with the five metrics. 15-minute team standup item.
  • Monthly: retrospective. What surprised us? What's not moving?
  • Day 30, 60, 90: formal checkpoints with written summary to the CTO.

The discipline is unsexy but matters. Pilots that fail usually fail at the measurement step — tools were adopted, dashboards weren't built, nobody can tell at the end what changed.

Part 7 — Realistic 90-day outcomes

Honest expectations for FlexBank:

  • Time per new test: 3 days → 1.5 days. (50% reduction is plausible with coding assistants alone.)
  • Triage backlog: 2 weeks → 3 days behind. (AI categorisation + dedup compounding.)
  • Flake rate: 8% → 4%. (AI-assisted root-cause analysis on the worst 30 flakes.)
  • Exploratory testing duration: 3 days → 2 days. (MCP loop covers some of what manual covered.)
  • Visual regressions to production: 1 missed in last 90 days → 0 in next 90 days. (Applitools coverage on the highest-traffic surfaces.)

If the pilot delivers most of those, it's a clear success. If it delivers fewer than three, the CTO has a basis for harder decisions in quarter four — which is also a useful outcome.

What's different about this pilot from a typical AI rollout

Three things worth flagging:

  • It starts with metrics, not tools. Most failed AI pilots start with "we bought Mabl, what should we do with it." This one starts with "here are five pain points; which tool moves which one."
  • It has explicit kill criteria. Each pilot has a go/no-go threshold. Tools that don't deliver get dropped without drama. Sunk cost is the enemy.
  • It treats AI as a workflow change, not a tool purchase. The training plan, governance, and prompt library matter as much as the licences. Teams that skip these end up with great tools nobody uses.

In the final lesson, you'll review whether your version of this pilot would actually work — and look at where AI in QA is heading next.

// tip to track lessons you complete and pick up where you left off across devices.