Synthetic test data with LLMs
Production data is sensitive, manually crafted data misses edge cases, and LLM-generated data is "easy" but rarely reproducible without discipline. The interesting question is not whether to generate synthetic data — it is which tool fits which use case and how to keep generations deterministic. Four approaches have carved out defensible niches: schema-first generators like Mockaroo with its native AI generation, production-data synthesisers like Tonic.ai, statistical modellers like the SDV library from DataCebo (sdv.dev), and hand-rolled LLM prompts. Each has a specific sweet spot, and knowing which to reach for before you start is the difference between a one-hour win and a three-day debugging exercise.
The generation flow
Four stages from prompt to versioned fixture — reproducibility is designed in, not added afterwards.
A reliable synthetic-data generation pipeline is not a single LLM call. It is four stages: a prompt that carries the schema and a seed value, an LLM or dedicated generator that produces structured output, a schema-validation step that rejects non-conforming output before it reaches the test suite, and a versioning step that tags the output so tests can pin to a known-good fixture set.
The validation step is the most commonly skipped and the most consequential to skip. Without it, schema drift between the LLM output and the test fixture expectation accumulates silently — the generated data looks plausible, the test imports it, and the assertion fails three layers deep with an error message that points nowhere useful.
Vendor landscape, May 2026
Schema-first, production-synthesised, statistical, and hand-rolled — four approaches with distinct sweet spots.
The table below covers the four main approaches to synthetic test data generation in 2026. This is a practical comparison of the approaches QA practitioners are most likely to encounter when setting up a test data pipeline, not a comprehensive market survey.
Gretel (acquired by NVIDIA — now part of NVIDIA's synthetic data tooling) was formerly a standalone vendor. The SDV open-source library (sdv.dev) is maintained by DataCebo, which publishes SDV Enterprise for teams requiring scale and support; a DataCebo case study documents ING Belgium achieving 100× test coverage using SDV Enterprise.
| Approach | Best fit | Reproducibility | Privacy guarantees | |
|---|---|---|---|---|
| Mockaroo (native AI generation) | Schema-first; native AI generation built in (no GPT bolt-on needed); export to CSV/JSON/SQL | Tabular fixtures, moderate volume | Schema-based (good) | Synthetic from scratch (strong) |
| Tonic.ai | Production-data masking + Tonic Textual for unstructured; referentially intact relational data; QA-first use case | Teams synthesising from real production patterns | ●High (deterministic transforms) | Strong with proper configuration |
| NVIDIA SDG (formerly Gretel) | Gretel acquired by NVIDIA in 2025; brand absorbed into NVIDIA's synthetic-data tooling for agentic AI training | Teams already in NVIDIA's AI stack | Model-dependent | Differential-privacy options |
| SDV / DataCebo | Open-source sdv Python library (sdv.dev) for tabular, multi-table, and sequential synthesis; SDV Enterprise from DataCebo for scale | Data scientists with existing tabular pipelines; ING Belgium: 100× test coverage with SDV Enterprise | Model-versioned (strong) | Built-in privacy metrics |
| Hand-rolled LLM prompts | Direct calls to Claude or similar models with custom prompt patterns | Small datasets, unusual schemas, exploration | Requires seed + temperature discipline | Depends on prompt construction |
Synthetic test data tools, May 2026
Prompt patterns for reproducible LLM generation
Three patterns eliminate the most common reproducibility failures — applied specifically to test data generation.
Prompt-pattern fundamentals are covered in the prompt patterns for test authoring guide. The patterns below are the test-data-specific applications of that discipline — focusing on seed and temperature discipline, schema-constrained output, and batched generation.
The seed and temperature constraint is the foundation. Setting temperature to 0 and a fixed seed value ensures the same input produces the same output on every run — essential for fixture versioning and for CI reruns that should produce identical data. Without it, each generation run produces a different dataset that breaks the fixture-pinning model entirely.
Schema-constrained output prevents the most common failure mode: the LLM returning valid JSON that does not match your fixture schema. Passing the full JSON schema in the prompt and including an instruction to validate against it before returning eliminates the majority of schema-drift issues. Mockaroo's creator has announced an AI agent for generating entire databases in seconds as the next-generation direction, but for hand-rolled prompts, explicit schema embedding remains the most reliable approach in 2026.
# Synthetic user generation — deterministic batch prompt
# API call: temperature=0, seed=42 (controls reproducibility)
You are generating synthetic test data for a UK e-commerce platform.
Produce exactly 20 user records as a JSON array matching this schema:
Schema:
{ "id": "uuid", "email": "string", "postcode": "string (UK format)",
"dob": "ISO 8601 date (1950–2005)", "tier": "bronze|silver|gold" }
Rules:
- IDs must be unique UUIDs: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
- Postcodes must be valid UK format (e.g. SW1A 1AA, M1 1AE)
- DOB must produce ages 18–75 as of 2026-01-01
- tier distribution: 60% bronze, 30% silver, 10% gold
Batch: 1 of 20 (IDs start at offset 0)
Return: JSON array only. No explanation or commentary.The batching discipline
Smaller batches with seed offsets outperform single large requests — consistently.
The most common implementation mistake is requesting large volumes in a single LLM call. A request for 1,000 rows in one prompt produces 1,000 rows of degrading output: repetition appears after approximately 50 rows, column formats drift, and hallucinated values that violate the schema appear with increasing frequency towards the end of the response.
The correct approach is batched generation with explicit seed offsets. Twenty deterministic batches of 50 rows, each with a different seed offset, produces 1,000 rows with consistent quality across the entire dataset. Each batch is independently verifiable against the schema, and failed batches can be regenerated without discarding valid data.
// WARNING
When SDV beats LLMs
LLMs excel at variety; SDV models distributions. For statistical realism in regression tests, use the right tool.
LLMs and dedicated synthetic-data tools solve different halves of the test-data problem. LLMs are good at variety — novel values, unusual combinations, adversarial edge cases. They are poor at preserving statistical properties: the distribution of values within a column, the correlations between columns, and the referential integrity between tables in a relational dataset.
SDV (sdv.dev), the open-source Python library maintained by DataCebo, models the statistical distribution of source data and generates new samples that preserve those distributions. For regression testing where fixtures need to statistically resemble production data — realistic age distributions, authentic postcode clustering, correlated purchase histories — SDV is the correct tool. For adversarial or edge-case inputs where statistical fidelity is irrelevant, hand-rolled LLM prompts win on speed and flexibility.
The practical guidance: use SDV for fixtures that anchor regression tests to realistic data; use LLM prompts for fixtures that stress edge cases and boundary conditions. The two approaches are complementary, not competitive.
// Read more