How would you generate realistic test data at scale for a marketplace search load test?

Question

Accepted Answer

Sanitised production query logs are gold — anonymise PII, then replay actual query distributions. For synthetic generation, model Zipfian distribution for query terms (long tail), realistic price/category mixes, and variation in personalisation signals. Volume by replaying at production-typical RPS. Marketplace search is a particularly hard case because the query distribution itself drives cache behaviour, query plan selection, and result-set sizes. Use the wrong distribution and the test misleads. Source 1 — anonymised production logs. Pull a sample of search logs (1-10M queries), strip PII (geo down to city, no IPs, hash user IDs), and replay. Pros: distributions are exactly correct (term frequency, filter combinations, pagination patterns). Cons: legal/privacy review, sometimes complex sanitisation. Source 2 — synthetic with realistic distributions. Query terms: Zipfian distribution. ~10% of queries account for 90% of volume ("iphone", "shoes"); the remaining 90% are long-tail (typo

How would you generate realistic test data at scale for a marketplace search load test?

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL

How would you generate realistic test data at scale for a marketplace search load test?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL