Test Data

Creating, managing and protecting the data tests run against — fixtures, seeds, synthetic data and masking.

8 terms

C

A plain-text file format in which each line represents a record and fields are separated by commas (or other delimiters such as tabs or semicolons). The first row is commonly a header row naming each column. CSV is the most common format for test data files, bulk imports, and data-driven test suites: a QA engineer creates one row per test case, feeds the file to a test runner, and the runner executes each row as a separate test. Common failure modes include unquoted fields containing commas, inconsistent column counts between rows, trailing newlines, non-UTF-8 encoding, and duplicate headers. Always validate a CSV structurally before using it as test input — a malformed header row will silently shift all values into the wrong columns.

D

Replacing sensitive fields in a real dataset with realistic but fake values — so a copy of production can be used for testing without exposing actual PII. Names become other names, card numbers become valid-format fakes, emails get scrambled, but the data keeps its shape, relationships, and distribution. The middle path between "test on raw production" (illegal) and "test on pure synthetic" (less realistic).

Running the same test logic against many input/output combinations, typically loaded from a CSV, JSON file, or database. Separates test data from test code so you can scale coverage without duplicating logic.

S

A known, controlled dataset loaded into a system before tests run, so every test starts from a predictable state. Seeding is what makes assertions reliable: if the database always begins with "User 1, 3 orders", a test can assert against those exact values instead of whatever happens to be there. The opposite of testing against a shared, drifting environment.

Artificially generated data that mimics the shape and statistical properties of real data without being real — fake names, plausible addresses, realistic-but-invented transactions. It lets teams test at volume and edge cases without copying production data (and its privacy risk). Tools like Faker generate it; the harder version preserves real distributions for performance/ML testing.

T

Provisioning, masking, refreshing, and tearing down data needed by tests. Done well, it's invisible. Done badly, it's the reason a third of tests fail on Mondays.

A known, fixed state used as a baseline for tests — sample data, a seeded database, or a configured environment that ensures repeatability across runs.

Two ways to produce test data. A fixture is a fixed, predefined dataset loaded as-is (the same "User 1" every time) — predictable but rigid. A factory generates objects on demand with sensible defaults you override per test (`buildUser({ role: 'admin' })`) — flexible and DRY. Fixtures suit a stable shared baseline; factories suit tests that each need a slightly different variant.