PII-safe synthetic data
The comforting story — "we use synthetic data so we are GDPR-safe" — is often wrong. Privacy preservation is a spectrum from pseudonymisation, the weakest technique, to full synthesis with differential privacy, the strongest, and the UK Data (Use and Access) Act 2025 and the ICO's evolving guidance on automated decision-making change what "safe enough" means in 2026. The practitioner question is not which label to attach to your test data; it is which re-identification risk is acceptable for your use case, and whether you can demonstrate that assessment to a regulator.
The privacy spectrum
Four techniques, four different re-identification risk profiles — the choice is a risk assessment, not a label.
Production data flows into a test environment through one of four privacy-preservation techniques: pseudonymisation, tokenisation, full synthesis, or differential privacy. Each represents a different point on the re-identification risk spectrum. The diagram below shows the transformation pipeline from production data to test environment across the four parallel paths.
What the diagram does not show is that the techniques vary not just in risk level but in what they preserve. Pseudonymisation preserves the exact record structure. Full synthesis preserves statistical properties of the data. Differential privacy deliberately degrades individual-level fidelity to provide mathematical privacy guarantees. Choosing between them is a function of what your tests actually need and what your risk assessment requires.
What each technique actually guarantees
Re-identification risk, referential integrity, and best-fit use case — compared across all four approaches.
The choice between techniques is a risk decision, not a tool decision. Pseudonymisation is the most common starting point for teams without formal data-privacy engineering experience — it is the easiest to implement and preserves the full record structure that functional tests require. It is also the weakest privacy technique, and should never be described as "anonymised data" to a regulator or compliance team.
| Technique | Re-identification risk | Referential integrity | Best fit | |
|---|---|---|---|---|
| Pseudonymisation | Replace direct identifiers (name, email) with consistent fake values; record structure unchanged | HIGH — linkage attacks via quasi-identifiers (postcode + DOB) are well-documented | Full | Dev environments; never claim "anonymised" to a regulator |
| Tokenisation | Replace identifiers with cryptographic tokens reversible with a key; the mapping table is held securely | MEDIUM — the mapping table is the attack surface | Full | When end-to-end tests need to round-trip back to production identifiers |
| Full synthesis (SDV, NVIDIA SDG, Tonic) | Generate new records that statistically resemble production but contain no real values | ●LOW — if the model is well-configured and evaluated against re-identification benchmarks | Model-dependent; multi-table synthesis requires explicit relational configuration | Regulated industries with mature data teams; Tonic.ai, DataCebo SDV Enterprise, NVIDIA SDG |
| Differential privacy | Add calibrated noise so individual records cannot be reverse-engineered even with auxiliary data | MATHEMATICALLY BOUNDED — formal ε-differential privacy guarantees | Degrades with privacy budget; high-ε settings restore utility but weaken guarantees | Highest-sensitivity workloads: healthcare, finance KYC, government records |
Privacy-preservation techniques compared, May 2026
The current UK/EU regulatory frame
"Synthetic = anonymous" is increasingly untenable — regulators now expect documented re-identification risk assessments.
The ICO's current strategic focus, as of its March 2026 AI and biometrics strategy update (ico.org.uk), is the Data (Use and Access) Act 2025 and the forthcoming automated decision-making (ADM) code of practice. The earlier ICO synthetic-data-specific guidance remains relevant, but the broader regulatory pressure has shifted towards requiring documented assessments of re-identification risk rather than accepting "synthetic data" as a blanket exemption from GDPR obligations.
At the European level, EDPB Opinion 28/2024 set the frame for anonymisation thresholds — specifically, that anonymisation is a property of the technique applied plus the context of deployment, not a property of the data itself. A dataset that is effectively anonymous in one context may not be anonymous in another, and the controller bears the burden of demonstrating that assessment.
NIST SP 800-188 (US frame, widely cited in international practice) provides the most operationally useful guidance on de-identification techniques: it categorises methods, describes attack models, and provides guidance on evaluating residual re-identification risk. Teams working in regulated industries should read it alongside ICO guidance, regardless of jurisdiction.
The practical takeaway for QA practitioners: if you are using synthetic test data in a regulated industry, document the technique you used, the re-identification risk assessment you performed, and the tools involved. "We used synthetic data" is not a sufficient answer to an ICO enquiry; "we used full synthesis with Tonic.ai, evaluated re-identification risk using [method], and assessed residual risk as low for the following reasons" is.
Re-identification — the failure mode
Synthetic data that looks too realistic may not have been stress-tested for linkage attacks.
The most persistent misconception in synthetic data practice is that realistic-looking data is safe data. A 2019 Imperial College study demonstrated that 99.98% of Americans could be re-identified from 15 demographic attributes in a dataset that had been "de-identified". Synthetic data that preserves real distributions — age distributions, postcode clusters, income correlations — is vulnerable to similar linkage attacks, because the attacker needs only one auxiliary dataset with overlapping attributes.
The operational check is not whether your synthetic data "looks anonymous". It is whether it survives a structured re-identification attempt using publicly available auxiliary data. For most QA teams this is a compliance team question, not an engineering question — but engineers building the synthetic data pipeline need to understand what they are handing off and what guarantees they can and cannot make about it.
// WARNING