Data Masking

Test Dataintermediateaka Data Obfuscationaka Anonymisation

// Definition

Replacing sensitive fields in a real dataset with realistic but fake values — so a copy of production can be used for testing without exposing actual PII. Names become other names, card numbers become valid-format fakes, emails get scrambled, but the data keeps its shape, relationships, and distribution. The middle path between "test on raw production" (illegal) and "test on pure synthetic" (less realistic).

// Why it matters

Teams routinely copy production to a test environment to reproduce real bugs — and quietly carry real customer PII into a less-secure place, which is a breach and often a legal violation. Masking lets you keep the realism of production data without the liability. QA cares because unmasked test environments are one of the most common, most serious data-governance failures, and verifying masking is itself a test.

// How to test

// Verify NO real PII survived the mask in the test dataset
cy.task('db:query', 'SELECT email, card_last4, ssn FROM users LIMIT 500')
  .then((rows: any[]) => {
    rows.forEach((r) => {
      expect(r.email).to.not.match(/@(gmail|yahoo|company)\.com$/) // real domains scrubbed
      expect(r.ssn).to.match(/^000-/)        // masked sentinel, not a real SSN
      expect(KNOWN_REAL_EMAILS).to.not.include(r.email) // no production address leaked
    })
  })

// Common mistakes

  • Masking obvious fields (name) but missing PII in free-text, logs, or JSON blobs
  • Breaking referential integrity (a masked user id no longer matches their orders)
  • Reversible masking (a consistent hash that can be re-identified)

// Related terms