Finding Duplicate and Orphaned Records

The previous lessons verified individual rows after specific actions. This one zooms out: a handful of queries you can run across the whole database to find data-integrity bugs that may never surface in a single test. Duplicates that shouldn't exist (a unique constraint isn't doing its job). Orphans pointing to deleted parents (a cascade rule is missing). Totals that don't match their line items (the application's calculation is wrong somewhere). These five queries together form what teams call a data health sweep — and they often find more bugs than the entire UI test suite.

The five integrity checks every QA team should know

Data integrity sweep

– Same email registered twice (UNIQUE broken?)
– Same order submitted twice within a minute
– Same payment recorded twice

– order_items pointing at a deleted order
– orders pointing at a deleted user
– products pointing at a deleted category

– orders.total != SUM(order_items)
– users.order_count != COUNT(orders)
– Inventory != sum of stock movements

users with NULL email –
products with NULL price –
orders with NULL user_id –

user_id values that don't exist in users –
product_id values pointing to deleted products –

Each branch is a class of bug. Each is detectable in a single SQL query. Run these queries on your test database before every release and you'll catch issues that the application's own logs won't surface.

Finding duplicates

A column declared UNIQUE should not have duplicates. But constraints get dropped, get added late after the data is already messy, or get bypassed by data-import scripts. The detector:

-- Duplicate emails — exposes broken UNIQUE constraints or imports gone wrong
SELECT email, COUNT(*) AS dupes
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

In a healthy database, this returns zero rows. If it returns anything, you have a real, reportable bug.

A more subtle case is near-duplicates — same logical record, different ids — caused by retry-on-failure logic or a double-clicked submit button:

-- Same user, same total, two orders within one minute — likely a double-submit
SELECT user_id, total, COUNT(*) AS dupes,
       MIN(created_at) AS first, MAX(created_at) AS last
FROM orders
WHERE created_at >= DATE('now', '-1 day')
GROUP BY user_id, total
HAVING COUNT(*) > 1
   AND (julianday(MAX(created_at)) - julianday(MIN(created_at))) * 86400 < 60;

(The julianday math is SQLite-specific; PostgreSQL would use EXTRACT(EPOCH FROM ...). The point is time difference under 60 seconds.) If this query returns rows, the application's idempotency story is broken — clicking Submit twice creates two orders.

Finding orphans

An orphan is a child row whose parent doesn't exist. The classic anti-join pattern from Chapter 3 finds them:

-- Order items whose order has been deleted
SELECT oi.id, oi.order_id, oi.product_id, oi.quantity
FROM order_items oi
LEFT JOIN orders o ON oi.order_id = o.id
WHERE o.id IS NULL;

In a database with proper foreign keys and ON DELETE CASCADE, this returns nothing. If it returns rows, either the FK isn't enforced or someone bypassed it (a bulk-delete script, an admin tool with elevated privileges).

-- Orders whose user has been deleted
SELECT o.id, o.user_id, o.total, o.created_at
FROM orders o
LEFT JOIN users u ON o.user_id = u.id
WHERE u.id IS NULL;

Each row is a real bug, since the application probably can't render it (no user to attach to).

-- Products in non-existent categories
SELECT p.id, p.name, p.category_id
FROM products p
LEFT JOIN categories c ON p.category_id = c.id
WHERE c.id IS NULL;

Finding mismatched totals

When the same fact is stored in two places — orders.total and the sum of order_items — they can drift. A query that compares them surfaces every drift:

SELECT
    o.id,
    o.total                            AS recorded,
    SUM(oi.quantity * oi.price)        AS computed,
    o.total - SUM(oi.quantity * oi.price) AS diff
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
GROUP BY o.id, o.total
HAVING o.total <> SUM(oi.quantity * oi.price);

In our seed data, this query is deliberately not empty — orders 2 and 4 have totals that don't match their line items. That's intentional, so you have something to find:

| id | recorded | computed | diff   |
|----|----------|----------|--------|
| 2  |    48.50 |    60.50 | -12.00 |
| 4  |    63.00 |    80.00 | -17.00 |

In a real codebase, every row in this result set is a bug — or, at minimum, a question to ask: "why don't these match?". Sometimes the answer is innocent (the order had a discount that's stored separately). More often it's a real calculation bug.

Finding missing required fields

A column that's NOT NULL in the schema can still be NULL in the data — if it was added later via a migration that didn't backfill, or if the constraint was applied loosely. The detector:

-- Users missing required fields
SELECT id, name, email FROM users WHERE name IS NULL OR email IS NULL;
 
-- Products missing a price
SELECT id, name FROM products WHERE price IS NULL;
 
-- Orders missing a user reference
SELECT id, total FROM orders WHERE user_id IS NULL;

Each row is "the schema says this should never happen, but it did." Real data tells you whether the schema's promises are kept.

Finding invalid foreign keys

Strictly, an enforced foreign key shouldn't let this happen. But foreign keys can be deferred, dropped during migrations, or bypassed by direct INSERTs from admin tools. A defensive sweep:

-- orders.user_id values that point at no user
SELECT DISTINCT user_id
FROM orders
WHERE user_id NOT IN (SELECT id FROM users WHERE id IS NOT NULL);

If your schema has FKs enforced, this should return zero. If it returns anything, the FK isn't doing what it claims to do.

Building a data-quality dashboard

Combine the queries into one sweep that runs nightly. Each query reports a "bad rows" count; ideally each count is zero:

-- "Bad rows" snapshot — one COUNT per check
SELECT 'duplicate emails'      AS check_name,
       COUNT(*)                AS bad_rows
FROM (SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1) d
UNION ALL
SELECT 'orphan order_items',
       COUNT(*)
FROM order_items oi
LEFT JOIN orders o ON oi.order_id = o.id
WHERE o.id IS NULL
UNION ALL
SELECT 'orders with NULL user_id',
       COUNT(*)
FROM orders WHERE user_id IS NULL
UNION ALL
SELECT 'orders whose total != items',
       COUNT(*)
FROM (
  SELECT o.id
  FROM orders o JOIN order_items oi ON oi.order_id = o.id
  GROUP BY o.id, o.total
  HAVING o.total <> SUM(oi.quantity * oi.price)
) d;

Sample result against our seed data:

| check_name                  | bad_rows |
|-----------------------------|----------|
| duplicate emails            | 0        |
| orphan order_items          | 0        |
| orders with NULL user_id    | 0        |
| orders whose total != items | 2        |

Three checks pass, one fails. The dashboard told the truth in 5 lines of SQL.

⚠️ Common Mistakes

Treating zero-row results as boring. Zero rows from a duplicate-detector is the good answer. The query running cleanly and returning empty is what success looks like — celebrate it.
Confusing schema-level and data-level integrity. A schema can declare UNIQUE; the data can still have duplicates if the constraint was added after the duplicates landed. Schema and data tell different stories — check both.
Letting "innocent" diffs hide real bugs. "Order total doesn't match items because there's a discount field" is a fine explanation — but only if the discount is captured somewhere. Always confirm the explanation; don't accept it on faith.

🎯 Practice Task

30 minutes. Use the e-commerce database. Run every query.

Run the duplicate-email detector. Confirm zero rows. Then INSERT a duplicate (email = 'alice@example.com') — does the schema's UNIQUE constraint reject it? If it does, your test of the detector has to be in a database without the constraint. Note your finding.
Run the orphan-order_items detector. Confirm zero rows. Then DELETE order id 1 (preview with SELECT first). Re-run the detector — you should now have rows. Re-INSERT the order (or ROLLBACK if you wrapped this in a transaction).
Run the orders-vs-items mismatch query against the seed data. You should get the two deliberately-inconsistent rows (orders 2 and 4). Read the result.
Write a query that finds users with no email set (email IS NULL). Even though our schema disallows it, write the query — this is exactly the sweep you'd run on a real database.
Compose your own UNION-ALL "data health" snapshot with at least 3 checks of your choosing. Run it.
Stretch: propose a sixth integrity check that's domain-specific — a rule the application should be enforcing — and turn it into a SELECT. Examples: "no order should have status='completed' with total=0", or "no product should have price=0 unless in_stock=FALSE." Imagine it as a release-gate query: zero rows means we ship.

Next lesson — the final one in this chapter — turns these integrity-detection queries into the kind of multi-table reports a PM actually opens and reads.