Synthetic Data

Test Dataintermediate

// Definition

Artificially generated data that mimics the shape and statistical properties of real data without being real — fake names, plausible addresses, realistic-but-invented transactions. It lets teams test at volume and edge cases without copying production data (and its privacy risk). Tools like Faker generate it; the harder version preserves real distributions for performance/ML testing.

// Why it matters

You can't safely test with real customer data (privacy law, breach risk) but you need realistic data to find real bugs. Synthetic data resolves the tension — realistic enough to exercise the system, fake enough to be safe. QA cares because the quality of synthetic data determines whether it surfaces real issues or just fills rows: garbage-but-valid data passes tests that real-shaped data would fail.

// How to test

// Generate realistic-but-fake data; verify it exercises real validation paths
import { faker } from '@faker-js/faker'
const user = {
  name: faker.person.fullName(),
  email: faker.internet.email(),
  postcode: faker.location.zipCode('??## #??'), // UK-shaped → tests real regex
}
cy.request('POST', '/api/users', user).its('status').should('eq', 201)
// edge cases at volume: generate 1000 varied records to surface boundary bugs

// Common mistakes

  • Data that's valid but unrealistic (all same length, no edge cases) — misses real bugs
  • Wrong locale/format (US zips when the app is UK) so validation isn't exercised
  • Treating synthetic data as a substitute for some real-shaped fixtures entirely

// Related terms