PII-safe synthetic data

10 min read · Reviewed May 2026 · compliancescore: partial — compliance landscape shifts; reviewed quarterly

The comforting story — "we use synthetic data so we are GDPR-safe" — is often wrong. Privacy preservation is a spectrum from pseudonymisation, the weakest technique, to full synthesis with differential privacy, the strongest, and the UK Data (Use and Access) Act 2025 and the ICO's evolving guidance on automated decision-making change what "safe enough" means in 2026. The practitioner question is not which label to attach to your test data; it is which re-identification risk is acceptable for your use case, and whether you can demonstrate that assessment to a regulator.

READ TIME10 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNThe four privacy-preservation techniques for test data, how they compare on guarantees, and what the current UK/EU regulatory frame actually requires.

The privacy spectrum

Four techniques, four different re-identification risk profiles — the choice is a risk assessment, not a label.

Production data flows into a test environment through one of four privacy-preservation techniques: pseudonymisation, tokenisation, full synthesis, or differential privacy. Each represents a different point on the re-identification risk spectrum. The diagram below shows the transformation pipeline from production data to test environment across the four parallel paths.

What the diagram does not show is that the techniques vary not just in risk level but in what they preserve. Pseudonymisation preserves the exact record structure. Full synthesis preserves statistical properties of the data. Differential privacy deliberately degrades individual-level fidelity to provide mathematical privacy guarantees. Choosing between them is a function of what your tests actually need and what your risk assessment requires.

The four techniques on a single risk-protection axis

Four privacy-preservation techniques for test data

What each technique actually guarantees

Re-identification risk, referential integrity, and best-fit use case — compared across all four approaches.

The choice between techniques is a risk decision, not a tool decision. Pseudonymisation is the most common starting point for teams without formal data-privacy engineering experience — it is the easiest to implement and preserves the full record structure that functional tests require. It is also the weakest privacy technique, and should never be described as "anonymised data" to a regulator or compliance team.

	Technique	Re-identification risk	Referential integrity	Best fit
Pseudonymisation	Replace direct identifiers (name, email) with consistent fake values; record structure unchanged	HIGH — linkage attacks via quasi-identifiers (postcode + DOB) are well-documented	Full	Dev environments; never claim "anonymised" to a regulator
Tokenisation	Replace identifiers with cryptographic tokens reversible with a key; the mapping table is held securely	MEDIUM — the mapping table is the attack surface	Full	When end-to-end tests need to round-trip back to production identifiers
Full synthesis (SDV, NVIDIA SDG, Tonic)	Generate new records that statistically resemble production but contain no real values	●LOW — if the model is well-configured and evaluated against re-identification benchmarks	Model-dependent; multi-table synthesis requires explicit relational configuration	Regulated industries with mature data teams; Tonic.ai, DataCebo SDV Enterprise, NVIDIA SDG
Differential privacy	Add calibrated noise so individual records cannot be reverse-engineered even with auxiliary data	MATHEMATICALLY BOUNDED — formal ε-differential privacy guarantees	Degrades with privacy budget; high-ε settings restore utility but weaken guarantees	Highest-sensitivity workloads: healthcare, finance KYC, government records

Privacy-preservation techniques compared, May 2026

The current UK/EU regulatory frame

"Synthetic = anonymous" is increasingly untenable — regulators now expect documented re-identification risk assessments.

The ICO's current strategic focus, as of its March 2026 AI and biometrics strategy update (ico.org.uk), is the Data (Use and Access) Act 2025 and the forthcoming automated decision-making (ADM) code of practice. The earlier ICO synthetic-data-specific guidance remains relevant, but the broader regulatory pressure has shifted towards requiring documented assessments of re-identification risk rather than accepting "synthetic data" as a blanket exemption from GDPR obligations.

At the European level, EDPB Opinion 28/2024 set the frame for anonymisation thresholds — specifically, that anonymisation is a property of the technique applied plus the context of deployment, not a property of the data itself. A dataset that is effectively anonymous in one context may not be anonymous in another, and the controller bears the burden of demonstrating that assessment.

NIST SP 800-188 (US frame, widely cited in international practice) provides the most operationally useful guidance on de-identification techniques: it categorises methods, describes attack models, and provides guidance on evaluating residual re-identification risk. Teams working in regulated industries should read it alongside ICO guidance, regardless of jurisdiction.

The practical takeaway for QA practitioners: if you are using synthetic test data in a regulated industry, document the technique you used, the re-identification risk assessment you performed, and the tools involved. "We used synthetic data" is not a sufficient answer to an ICO enquiry; "we used full synthesis with Tonic.ai, evaluated re-identification risk using [method], and assessed residual risk as low for the following reasons" is.

Re-identification — the failure mode

Synthetic data that looks too realistic may not have been stress-tested for linkage attacks.

The most persistent misconception in synthetic data practice is that realistic-looking data is safe data. A 2019 Imperial College study demonstrated that 99.98% of Americans could be re-identified from 15 demographic attributes in a dataset that had been "de-identified". Synthetic data that preserves real distributions — age distributions, postcode clusters, income correlations — is vulnerable to similar linkage attacks, because the attacker needs only one auxiliary dataset with overlapping attributes.

The operational check is not whether your synthetic data "looks anonymous". It is whether it survives a structured re-identification attempt using publicly available auxiliary data. For most QA teams this is a compliance team question, not an engineering question — but engineers building the synthetic data pipeline need to understand what they are handing off and what guarantees they can and cannot make about it.

// WARNING

Synthetic ≠ anonymous. A 2019 Imperial College study showed 99.98% of Americans could be re-identified from 15 demographic attributes in a 'de-identified' dataset. Synthetic data that preserves real distributions is vulnerable to similar linkage attacks. If your synthetic data 'looks too realistic', it probably hasn't been stress-tested for re-identification risk.