Synthetic Data
// Definition
Artificially generated data that mimics the shape and statistical properties of real data without being real — fake names, plausible addresses, realistic-but-invented transactions. It lets teams test at volume and edge cases without copying production data (and its privacy risk). Tools like Faker generate it; the harder version preserves real distributions for performance/ML testing.
// Why it matters
You can't safely test with real customer data (privacy law, breach risk) but you need realistic data to find real bugs. Synthetic data resolves the tension — realistic enough to exercise the system, fake enough to be safe. QA cares because the quality of synthetic data determines whether it surfaces real issues or just fills rows: garbage-but-valid data passes tests that real-shaped data would fail.
// How to test
// Generate realistic-but-fake data; verify it exercises real validation paths
import { faker } from '@faker-js/faker'
const user = {
name: faker.person.fullName(),
email: faker.internet.email(),
postcode: faker.location.zipCode('??## #??'), // UK-shaped → tests real regex
}
cy.request('POST', '/api/users', user).its('status').should('eq', 201)
// edge cases at volume: generate 1000 varied records to surface boundary bugs// Common mistakes
- Data that's valid but unrealistic (all same length, no edge cases) — misses real bugs
- Wrong locale/format (US zips when the app is UK) so validation isn't exercised
- Treating synthetic data as a substitute for some real-shaped fixtures entirely
// Related terms
Seed Data
A known, controlled dataset loaded into a system before tests run, so every test starts from a predictable state. Seeding is what makes assertions reliable: if the database always begins with "User 1, 3 orders", a test can assert against those exact values instead of whatever happens to be there. The opposite of testing against a shared, drifting environment.
Data Masking
Replacing sensitive fields in a real dataset with realistic but fake values — so a copy of production can be used for testing without exposing actual PII. Names become other names, card numbers become valid-format fakes, emails get scrambled, but the data keeps its shape, relationships, and distribution. The middle path between "test on raw production" (illegal) and "test on pure synthetic" (less realistic).
Test Data Management
Provisioning, masking, refreshing, and tearing down data needed by tests. Done well, it's invisible. Done badly, it's the reason a third of tests fail on Mondays.