Synthetic test data with LLMs

11 min read · Reviewed May 2026 · generationscore: pass — synthetic data tooling has stabilised; reviewed quarterly

Production data is sensitive, manually crafted data misses edge cases, and LLM-generated data is "easy" but rarely reproducible without discipline. The interesting question is not whether to generate synthetic data — it is which tool fits which use case and how to keep generations deterministic. Four approaches have carved out defensible niches: schema-first generators like Mockaroo with its native AI generation, production-data synthesisers like Tonic.ai, statistical modellers like the SDV library from DataCebo (sdv.dev), and hand-rolled LLM prompts. Each has a specific sweet spot, and knowing which to reach for before you start is the difference between a one-hour win and a three-day debugging exercise.

READ TIME11 min
DIFFICULTYintermediate
REVIEWEDMay 2026
YOU'LL LEARNHow LLMs and dedicated synthetic-data tools compare for generating realistic test fixtures, and the prompt patterns that produce reproducible output.

The generation flow

Four stages from prompt to versioned fixture — reproducibility is designed in, not added afterwards.

A reliable synthetic-data generation pipeline is not a single LLM call. It is four stages: a prompt that carries the schema and a seed value, an LLM or dedicated generator that produces structured output, a schema-validation step that rejects non-conforming output before it reaches the test suite, and a versioning step that tags the output so tests can pin to a known-good fixture set.

The validation step is the most commonly skipped and the most consequential to skip. Without it, schema drift between the LLM output and the test fixture expectation accumulates silently — the generated data looks plausible, the test imports it, and the assertion fails three layers deep with an error message that points nowhere useful.

Flow diagramProcess flow: Prompt + schema → LLM or generator → Schema validation → Test fixturePrompt + sche…+ seed valueLLM or genera…deterministicSchema valida…type + constrai…Test fixtureversioned
Synthetic data generation pipeline

Vendor landscape, May 2026

Schema-first, production-synthesised, statistical, and hand-rolled — four approaches with distinct sweet spots.

The table below covers the four main approaches to synthetic test data generation in 2026. This is a practical comparison of the approaches QA practitioners are most likely to encounter when setting up a test data pipeline, not a comprehensive market survey.

Gretel (acquired by NVIDIA — now part of NVIDIA's synthetic data tooling) was formerly a standalone vendor. The SDV open-source library (sdv.dev) is maintained by DataCebo, which publishes SDV Enterprise for teams requiring scale and support; a DataCebo case study documents ING Belgium achieving 100× test coverage using SDV Enterprise.

ApproachBest fitReproducibilityPrivacy guarantees
Mockaroo (native AI generation)Schema-first; native AI generation built in (no GPT bolt-on needed); export to CSV/JSON/SQLTabular fixtures, moderate volumeSchema-based (good)Synthetic from scratch (strong)
Tonic.aiProduction-data masking + Tonic Textual for unstructured; referentially intact relational data; QA-first use caseTeams synthesising from real production patternsHigh (deterministic transforms)Strong with proper configuration
NVIDIA SDG (formerly Gretel)Gretel acquired by NVIDIA in 2025; brand absorbed into NVIDIA's synthetic-data tooling for agentic AI trainingTeams already in NVIDIA's AI stackModel-dependentDifferential-privacy options
SDV / DataCeboOpen-source sdv Python library (sdv.dev) for tabular, multi-table, and sequential synthesis; SDV Enterprise from DataCebo for scaleData scientists with existing tabular pipelines; ING Belgium: 100× test coverage with SDV EnterpriseModel-versioned (strong)Built-in privacy metrics
Hand-rolled LLM promptsDirect calls to Claude or similar models with custom prompt patternsSmall datasets, unusual schemas, explorationRequires seed + temperature disciplineDepends on prompt construction

Synthetic test data tools, May 2026

Prompt patterns for reproducible LLM generation

Three patterns eliminate the most common reproducibility failures — applied specifically to test data generation.

Prompt-pattern fundamentals are covered in the prompt patterns for test authoring guide. The patterns below are the test-data-specific applications of that discipline — focusing on seed and temperature discipline, schema-constrained output, and batched generation.

The seed and temperature constraint is the foundation. Setting temperature to 0 and a fixed seed value ensures the same input produces the same output on every run — essential for fixture versioning and for CI reruns that should produce identical data. Without it, each generation run produces a different dataset that breaks the fixture-pinning model entirely.

Schema-constrained output prevents the most common failure mode: the LLM returning valid JSON that does not match your fixture schema. Passing the full JSON schema in the prompt and including an instruction to validate against it before returning eliminates the majority of schema-drift issues. Mockaroo's creator has announced an AI agent for generating entire databases in seconds as the next-generation direction, but for hand-rolled prompts, explicit schema embedding remains the most reliable approach in 2026.

# Synthetic user generation — deterministic batch prompt
# API call: temperature=0, seed=42 (controls reproducibility)

You are generating synthetic test data for a UK e-commerce platform.
Produce exactly 20 user records as a JSON array matching this schema:

Schema:
{ "id": "uuid", "email": "string", "postcode": "string (UK format)",
  "dob": "ISO 8601 date (1950–2005)", "tier": "bronze|silver|gold" }

Rules:
- IDs must be unique UUIDs: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
- Postcodes must be valid UK format (e.g. SW1A 1AA, M1 1AE)
- DOB must produce ages 18–75 as of 2026-01-01
- tier distribution: 60% bronze, 30% silver, 10% gold

Batch: 1 of 20 (IDs start at offset 0)
Return: JSON array only. No explanation or commentary.
Prompt pattern for reproducible test data — seed, schema, batch (set temperature: 0 and seed at the API call level)

The batching discipline

Smaller batches with seed offsets outperform single large requests — consistently.

The most common implementation mistake is requesting large volumes in a single LLM call. A request for 1,000 rows in one prompt produces 1,000 rows of degrading output: repetition appears after approximately 50 rows, column formats drift, and hallucinated values that violate the schema appear with increasing frequency towards the end of the response.

The correct approach is batched generation with explicit seed offsets. Twenty deterministic batches of 50 rows, each with a different seed offset, produces 1,000 rows with consistent quality across the entire dataset. Each batch is independently verifiable against the schema, and failed batches can be regenerated without discarding valid data.

// WARNING

Don't ask an LLM for 1,000 rows in one prompt. Output degrades after approximately 50 rows — repetition, hallucinated columns, format drift. Batch with seed offsets: 20 deterministic batches of 50 rows beats one 1,000-row request every time, and gives you reproducible reruns per batch.

When SDV beats LLMs

LLMs excel at variety; SDV models distributions. For statistical realism in regression tests, use the right tool.

LLMs and dedicated synthetic-data tools solve different halves of the test-data problem. LLMs are good at variety — novel values, unusual combinations, adversarial edge cases. They are poor at preserving statistical properties: the distribution of values within a column, the correlations between columns, and the referential integrity between tables in a relational dataset.

SDV (sdv.dev), the open-source Python library maintained by DataCebo, models the statistical distribution of source data and generates new samples that preserve those distributions. For regression testing where fixtures need to statistically resemble production data — realistic age distributions, authentic postcode clustering, correlated purchase histories — SDV is the correct tool. For adversarial or edge-case inputs where statistical fidelity is irrelevant, hand-rolled LLM prompts win on speed and flexibility.

The practical guidance: use SDV for fixtures that anchor regression tests to realistic data; use LLM prompts for fixtures that stress edge cases and boundary conditions. The two approaches are complementary, not competitive.

Related glossary terms