Q6 of 26 · SQL

Write a query to find duplicate records in a users table based on email address.

SQLJuniorsqlduplicatesgroup-bydata-integrity

Short answer

Short answer: Use GROUP BY email HAVING COUNT(*) > 1 to find duplicated emails, then join back to get the full rows.

Detail

Two-step approach:

Step 1 — find the duplicated values:

SELECT email, COUNT(*) AS cnt
FROM   users
GROUP BY email
HAVING COUNT(*) > 1;

Step 2 — retrieve all affected rows (so you can see IDs and decide which to keep):

SELECT u.*
FROM   users u
JOIN (
  SELECT email
  FROM   users
  GROUP BY email
  HAVING COUNT(*) > 1
) dups ON dups.email = u.email
ORDER BY u.email, u.created_at;

For a more complete duplicate check (duplicates across multiple columns — e.g., same first name + last name + DOB):

SELECT first_name, last_name, date_of_birth, COUNT(*)
FROM   customers
GROUP BY first_name, last_name, date_of_birth
HAVING COUNT(*) > 1;

When to use this in QA: after a data migration, after a bulk import, or when the app is missing a UNIQUE constraint you expected to be there.

// EXAMPLE

-- All user rows where the email appears more than once
SELECT u.id, u.email, u.created_at
FROM   users u
JOIN (
  SELECT email
  FROM   users
  GROUP BY email
  HAVING COUNT(*) > 1
) dups ON dups.email = u.email
ORDER BY u.email, u.created_at;

// MODEL ANSWER

I approach this in two steps. The first step is identifying which values appear more than once. I use GROUP BY on the email column to collapse rows with the same email into groups, COUNT star to count how many rows are in each group, and HAVING COUNT star greater than one to filter to only the duplicated values. That gives me a list of emails that appear more than once. The second step is pulling back the full rows for those duplicated emails so I can see the IDs and timestamps and decide which record to keep. I do that by joining the users table back to that subquery on the email column and ordering by email then created_at. In a QA context I would typically run this check after a data migration, after a bulk import, or when I suspect a unique constraint that should be in place is not actually there. If the duplication spans multiple columns — say the same first name, last name, and date of birth — I would add all three to the GROUP BY and HAVING clauses. The pattern is identical; you are just widening the definition of what counts as a duplicate.

// WHAT INTERVIEWERS LOOK FOR

Clean GROUP BY + HAVING pattern. Bonus: joining back to get full rows, and mentioning this is a post-migration or data-integrity check.