Q21 of 38 · Performance

How would you generate realistic test data at scale for a marketplace search load test?

PerformanceSeniorperformancetest-datasearchzipfianmarketplace

Short answer

Short answer: Sanitised production query logs are gold — anonymise PII, then replay actual query distributions. For synthetic generation, model Zipfian distribution for query terms (long tail), realistic price/category mixes, and variation in personalisation signals. Volume by replaying at production-typical RPS.

Detail

Marketplace search is a particularly hard case because the query distribution itself drives cache behaviour, query plan selection, and result-set sizes. Use the wrong distribution and the test misleads.

Source 1 — anonymised production logs. Pull a sample of search logs (1-10M queries), strip PII (geo down to city, no IPs, hash user IDs), and replay. Pros: distributions are exactly correct (term frequency, filter combinations, pagination patterns). Cons: legal/privacy review, sometimes complex sanitisation.

Source 2 — synthetic with realistic distributions.

  • Query terms: Zipfian distribution. ~10% of queries account for 90% of volume ("iphone", "shoes"); the remaining 90% are long-tail (typos, niche products). Generators that pick terms uniformly from a vocabulary produce uniform load — wildly unrealistic. Use numpy.random.zipf or a real query-frequency dump.
  • Filter combinations: most users apply 0-2 filters; some power users apply 5+. Model the distribution.
  • Pagination: 80% don't paginate; 15% go to page 2; 5% deeper. Test only deep pagination if that's your scenario, but match real usage for general load.
  • Geographic / personalisation: regional preference and user history change cache hit rate. Vary user IDs across a realistic set.

Source 3 — a small handcrafted set for known-edge-case correctness. Empty searches, single-character terms, 200-char queries, queries with special characters, queries that match millions of items. Mix into the load test at low frequency; they exercise edge code paths.

Volume:

  • Cardinality of search terms should approach production unique count — million-plus terms isn't unusual. A test with 100 terms hits caches every time and hides real cache-miss latency.
  • Repeat-rate matters: production might have 30% query repetition within 15 minutes; lower repetition means worse cache hit rate.

Tooling: load via SharedArray (k6) or CSV Data Set Config (JMeter). For very large datasets, stream from a file rather than loading entirely.

Validation: before you trust the test, compare its cache hit rate to production's. If prod is 80% and the test is 20%, your distribution is wrong — fix the data, not the system.

// WHAT INTERVIEWERS LOOK FOR

Awareness that distribution matters as much as volume, knowing Zipfian for query terms, validating against production cache hit rate. Bonus for citing privacy/sanitisation considerations.

// COMMON PITFALL

Generating uniform-random queries and concluding 'the search system is slow' — most production queries hit cache; uniform-random tests measure cache-miss path almost exclusively.