ReferenceAdvanced5-7 min reference
ETL Testing
ETL (Extract, Transform, Load) pipelines move and reshape data between systems — and silently corrupt reports when a transform is wrong. Unlike a one-off migration, pipelines run repeatedly, so tests must be repeatable too. This sheet covers what to check at each stage; it's close kin to Data Migration Testing (linked below).
Test at each stage
| Stage | Verify |
|---|---|
| Extract | All expected source rows pulled; correct filters/incremental window; source unchanged |
| Transform | Business rules applied correctly; joins, aggregations, lookups, dedup |
| Load | Target row counts; upserts vs inserts; no duplicates; constraints hold |
Data-quality dimensions
| Dimension | Asks |
|---|---|
| Completeness | All rows/fields present? |
| Accuracy | Values match the rule/source of truth? |
| Consistency | Same value across tables/feeds? |
| Uniqueness | No unintended duplicates? |
| Validity | Types, ranges, formats correct? |
| Timeliness | Data fresh / within SLA? |
Pipeline-specific checks
- Idempotency / re-runs: running twice doesn't double-load.
- Incremental loads: only new/changed rows; watermark correct.
- Late / out-of-order data handled.
- Failure & restart: a mid-run failure recovers without partial corruption.
- Schema drift: a new/renamed source column is detected, not silently dropped.
Common mistakes
- Testing only the final table, never the transform logic that produced it.
- Comparing averages instead of row-level reconciliation (masks offsetting errors).
- Ignoring incremental-load and re-run idempotency (double counting).
- No alerting on data-quality failures — bad data reaches dashboards unnoticed.
- Tiny test data that never exercises late/duplicate/null edge cases.
// Related resources