AI test generation on pull-request

12 min read · Reviewed May 2026 · test-gen

The idea is simple: you open a pull request and an agent reads the diff, writes the tests, and surfaces a suggested test commit alongside your code change. In 2026, the pieces exist. GitHub Copilot Coding Agent is generally available. Cursor background agents are in public preview. Devin is commercially deployed. Claude Code runs in CI. What does not exist yet is a clean, reliable, consistent workflow where you can trust the output without reading it carefully. This guide is a May 2026 snapshot of where that workflow stands, what works, and where it quietly fails.

READ TIME12 min
DIFFICULTYadvanced
REVIEWEDMay 2026
YOU'LL LEARNWhat "AI writes the tests on PR open" actually looks like in 2026, which agents are GA, and where they go wrong.

The shape of the workflow

PR opens → agent reads diff → tests appear in draft PR or comment → human reviewer decides what to keep.

The canonical workflow has four stages. The PR opens. An agent reads the diff and the surrounding code context — the test files for the changed functions, the existing test conventions, the imports and dependencies. The agent scaffolds tests and presents them either as a draft PR commit, a PR comment with code blocks, or a suggested change in the review interface. The human reviewer reads the suggested tests, amends what needs amending, accepts what is correct, and discards what is irrelevant.

The critical step is the last one. This workflow only works if the human reviewer actually reads the test assertions — not just the test names, not just the code coverage percentage, but the specific conditions being tested and the expected values. The risk is that review disciplines built around code review do not automatically extend to assertion review: a test that calls a function and checks that it does not throw is not the same as a test that checks the function produces the right answer.

The workflow is also sensitive to context quality. Agents that can read the full module — including existing tests — produce significantly better suggestions than agents with a narrow diff window. Agents that understand the test framework in use produce idiomatic tests; agents that do not produce structurally valid but stylistically wrong ones that add friction to the review.

Flow diagramProcess flow: PR opened → Agent reads diff → Tests scaffolded → Human reviewerPR openeddiff + metadataAgent reads d…+ surrounding c…Tests scaffol…draft PR or com…Human revieweraccept / refine…
AI-driven test generation on PR — the canonical flow

Vendor landscape, May 2026

Five agents, five different surfaces — from GitHub-native to terminal-native to fully autonomous.

The field is fragmenting by surface rather than by capability. Most of the agents are built on top of frontier models and produce comparable quality output; the differentiation is in where they live and how they integrate into your workflow.

GitHub Copilot Coding Agent is generally available as of late 2025. You assign it an issue; it spawns an environment via GitHub Actions, reads the repository, makes code changes, and opens a draft PR — including tests scaffolded to match the repo's existing test conventions. The quality of test generation correlates strongly with how well-structured your existing test suite is: if your tests follow consistent naming and file organisation conventions, the agent replicates them accurately. See docs.github.com for current feature state.

Cursor background agents (in public preview as of May 2026) run in a cloud environment while you continue coding locally. You assign a task ("write tests for the new UserService methods"), and the agent works asynchronously, surfacing the result when you next check. The integration with your local editor makes reviewing and amending the output faster than a PR-based workflow.

Devin, from Cognition (cognition.ai), is positioned as a fully autonomous software engineering agent. Given a ticket, it produces a complete PR including tests. SWE-bench performance metrics improved through the 2.0 release. At $500 per month and above, it targets teams that want end-to-end delegation rather than a review-and-accept workflow.

Claude Code (Anthropic, claude.com/code) is a terminal-native agent that reads the repository, writes code and tests, and integrates well with MCP server tooling for extended tool access. It does not operate inside GitHub's native review interface but can run in CI via GitHub Actions. Teams comfortable with CLI-first workflows find it well-suited to writing tests with deep codebase context.

TriggerTest output styleIntegration depthMaturity
GitHub Copilot Coding AgentIssue assignmentDraft PR — code + testsGitHub nativeGA (late 2025)
Cursor background agentsTask description (IDE)Async test scaffoldCursor IDEPublic preview
Devin (Cognition)Ticket assignmentComplete PR with testsGitHub / GitLabGA ($500+/mo)
Claude Code (Anthropic)CLI instruction / CICode + tests in contextTerminal / GitHub ActionsGA
Copilot CLI agentgh agent-task createCode + testsGitHub CLIPublic preview (May 2026)

AI test-generation tools, May 2026 — the field is moving fast

What AI-generated tests actually look like

Scaffolding and happy paths, yes. Meaningful assertions on edge cases — less reliably.

AI-generated tests are reliable at two things: scaffolding the test structure (imports, describe blocks, setup boilerplate, mock configuration) and covering the obvious happy path (the function is called with valid input and returns a valid result). Both are genuinely useful: writing boilerplate is tedious, and scaffolding removes the friction that causes developers to skip tests entirely.

Where the quality degrades is in assertion content. The most common failure mode is the tautological assertion: a test that constructs expected output by calling the same function under test, then asserts that calling the function again produces the same result. This will always pass, including when the function is broken. It gives you a green CI light and zero coverage of anything that matters.

The second failure mode is coverage without verification: the test calls a method, checks that it does not throw, and declares victory. This covers the happy path in the coverage report while testing nothing about the method's correctness under anything other than the nominal case.

Both failure modes are visible to a careful human reviewer. Neither is obvious if you are reviewing at speed. The review discipline required to catch them is the same discipline required to write good tests in the first place — which means the agent reduces the writing cost but not the thinking cost.

// WARNING

Auto-generated tests can pass while testing the wrong thing. A test that calls assertEquals(produce(x), produce(x)) will always pass. Every PR with generated tests needs human verification that the assertion is meaningful, not just present.

The review discipline that makes it work

Generated tests are a draft, not a decision — the assertion review is the part the agent cannot do for you.

The teams getting value from AI test generation treat generated tests as a first draft requiring substantive review, not as a complete artefact requiring approval. This is a meaningful cultural shift. The instinct when AI suggests something competent-looking is to accept it; the right response is to read the assertions as carefully as you would read a junior engineer's test commit.

Mutation testing is the fastest way to find tautological tests: run your test suite against intentionally broken versions of your code. A test that passes against broken code is not testing what it claims to test. Tools like Stryker (JavaScript), PITest (Java), and Mutmut (Python) can run automatically in CI and flag tests with low mutation scores — making the problem visible without requiring individual review of each assertion.

The practical review checklist: Does the test fail when you break the code? Does it test edge cases, or only the happy path? Does the assertion check the right thing — the output value, not just the absence of an exception? Is the test isolated, or does it depend on the success of a previous test in the same file?

An AI-generated test you didn't read is worse than no test — it gives you the false security of a green CI.

What's next in this space

Test-gen-only products are being absorbed; the coding agents are converging on "write tests with the change" as the default.

The pattern emerging in 2026 is that test generation is becoming a feature of coding agents rather than a standalone product category. Purpose-built test-generation tools that do not have a broader coding agent offering are being squeezed: Copilot, Cursor, Claude Code, and Devin all produce tests as part of producing code, and their test quality improves as their code quality improves.

The convergence is toward a workflow where "write tests" is not a separate instruction but a default behaviour: any code change the agent makes includes tests by default, without the developer asking. This is the direction the major coding agents are moving. The open question is quality — whether the default tests are substantively correct or tautologically green.

This guide is a May 2026 snapshot. The product surface is moving quickly; the evaluation discipline described above is not. Whether an agent writes a test for you or you write it yourself, the obligation to verify the assertion content is permanent.

Related glossary terms