AI for data quality validation

9 min read · Reviewed May 2026 · validation

AI-generated test data without quality validation is rolling the dice. Schema-conformance, statistical realism, referential integrity — each fails differently and requires different detection. The tools have matured fast. GX (formerly Great Expectations) with its 20M+ monthly downloads has become the de-facto open-source standard for schema contracts. LLM-based anomaly detection has emerged as a useful complement, not a replacement. Knowing which to combine for which use case is the design decision that determines whether your data quality pipeline is a safety net or a false sense of security.

READ TIME9 min
DIFFICULTYintermediate
REVIEWEDMay 2026
YOU'LL LEARNWhich data-quality tools fit which use case, and why LLM-as-validator is a complement to schema contracts — not a replacement.

Four quality dimensions

Schema conformance, distribution realism, referential integrity, and anomaly detection — each fails differently.

Test data quality fails in four distinct ways, each requiring a different detection mechanism. Schema conformance checks whether each row matches the declared contract — column types, nullability, format constraints, value ranges. This is the most common quality check and the one most tooling handles well.

Distribution realism checks whether the column-level statistics of your synthetic data still resemble production: if the age distribution in your user fixture has drifted significantly from the production baseline, tests that depend on representative data will produce misleading results. This check requires a baseline snapshot from production and a comparison threshold.

Referential integrity checks whether foreign keys in synthetic data point to records that actually exist in the same dataset — a common failure mode when synthetic tables are generated independently rather than as a joined set. A user fixture with order records that reference non-existent user IDs produces test failures that look like application bugs, not data bugs.

Anomaly detection checks for anything statistically unusual that none of the above checks would catch: a column where ninety percent of values are identical (low entropy), a date column where all values cluster in a single week, or a numeric column where one row has a value several standard deviations from the rest. LLM-based anomaly detection is the most useful here — it surfaces patterns that deterministic rules would miss.

The validation flow

Schema check first, distribution diff second, anomaly detection third — hard block on failure, advisory on anomaly.

The validation pipeline applies checks in order of certainty. Schema conformance is deterministic — a row either matches the contract or it does not. A failure here is a hard CI block: the fixture cannot be used until the schema violation is resolved. Distribution and referential integrity checks are configurable-threshold failures: small deviations are informational, large deviations block.

Anomaly detection runs last and is advisory: it ranks rows by suspicion and surfaces them for review rather than blocking. Blocking on an LLM anomaly flag would produce too many false positives; surfacing anomalies for human review produces signal that would otherwise require manual inspection of thousands of rows to find.

Flow diagramProcess flow: Generated dataset → Schema check → Diff vs baseline → Alert + block CIGenerated dat…fresh fixtureSchema checkGX / Soda contr…Diff vs basel…distribution + …Alert + block…or pass + tag
Data-quality validation in CI

Vendor landscape, May 2026

GX is the open-source standard; Soda offers a YAML-first alternative; Monte Carlo adds observability at scale.

The data-quality tooling landscape has consolidated around a few clear patterns. GX (formerly Great Expectations) has 20M+ monthly downloads, making it the most widely adopted open-source data quality library; GX Cloud is the commercial product layered above the open-source core. Soda Core and Monte Carlo address different points on the spectrum from developer-owned to platform-managed quality.

ToolOpen-sourceAI featuresBest fit
GX (formerly Great Expectations)GX Core open-source library (20M+ monthly downloads, 13k+ community); GX Cloud commercial productYes (GX Core)GX Cloud uses LLM for expectation suggestion from data samplesTeams wanting the most-adopted open-source standard with optional commercial uplift
Soda CoreOpen-source SodaCL declarative DSL for data quality; YAML-driven checksYes (Soda Core)Soda Cloud adds ML-driven anomaly detection and alertingTeams who prefer YAML over Python for check definition
Monte CarloCommercial data observability platform; ML-driven anomaly detection across full data pipelinesNoCore to the product — anomaly detection is the primary value propositionLarge data organisations wanting end-to-end pipeline observability and alerting
LLM-as-validator (custom)Direct prompt: "does this dataset look anomalous given this baseline?" — advisory onlyN/AThe entire toolExploratory anomaly checks; never as the only validation layer

Data-quality tools, May 2026

LLM-as-validator — the myth

Pattern matching is not formal verification — use LLMs for anomaly suggestion, schema contracts for hard validation.

The appeal of LLM-as-validator is real: describe your dataset and ask whether it looks anomalous. The result often contains genuinely useful signal. The problem is what the LLM cannot do: provide a formal guarantee. A model that says "this row looks reasonable" has applied pattern matching against its training distribution, not a deterministic contract check. The false-negative rate is unknown and unknowable without systematic evaluation.

The appropriate role for LLM validation is anomaly suggestion, not pass/fail determination. Run GX or Soda first — those checks are deterministic and provide the hard quality floor. Run an LLM pass second, to surface patterns the schema contract would not catch: unusual clustering, low-entropy columns, outliers that are technically schema-valid but statistically implausible.

// MYTH

LLM-as-validator isn't a replacement for schema contracts. A model that says 'this row looks reasonable' has zero formal guarantee — it's pattern matching against its training distribution. Use LLMs for anomaly suggestion (where to look), use schema contracts (GX / SodaCL) for hard validation (pass/fail). They're complements, not substitutes.

A practical stack

GX Core for the floor, LLM anomaly detection for the ceiling — Monte Carlo or Soda Cloud if budget allows.

The practical starting point for a QA team building a data quality pipeline requires no commercial tooling. GX Core (open-source) handles schema conformance and expectations as code — deterministic, version-controlled, fails CI hard on contract violation. An LLM anomaly-detection prompt run as a post-generation advisory check surfaces statistical anomalies that the expectation suite would not catch. Together they cover the floor and the ceiling of data quality validation.

For teams with budget and scale: Soda Cloud adds managed anomaly detection without requiring LLM prompt engineering; Monte Carlo adds full pipeline observability across every upstream source. Both are worthwhile at large data-engineering scale. For a QA team running a test fixture pipeline, GX Core plus an LLM advisory pass is the most cost-effective starting point.

# GX (formerly Great Expectations) — schema + expectations as code
import great_expectations as gx

context = gx.get_context()
suite = context.add_expectation_suite("synthetic_users_v1")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="synthetic_users_v1"
)

validator.expect_column_to_exist("email")
validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_match_regex(
    "email", r"^[^@\s]+@[^@\s]+\.[^@\s]+$"
)
validator.expect_column_values_to_be_between("age", min_value=18, max_value=75)
validator.expect_column_value_lengths_to_be_between("postcode", min_value=5, max_value=8)
validator.expect_column_values_to_be_in_set("tier", ["bronze", "silver", "gold"])

results = validator.validate()
assert results.success, f"Data quality check failed: {results}"
GX Core expectation suite — schema contract as code, fails CI on violation
# LLM anomaly detection — advisory only, not a hard CI block
# Run after GX validation passes; surfaces statistical anomalies

You are reviewing a synthetic test dataset for anomalies.
The dataset is a JSON array of user records (sample shown below).

Baseline statistics (from last production snapshot):
- Age: median 42, range 18–75, roughly normal distribution
- Tier: 60% bronze, 30% silver, 10% gold
- Postcodes: mix of UK formats, realistic regional clustering

Review the records below and identify:
1. Any row that is statistically anomalous relative to the baseline
2. Any column showing unexpected value clustering or low entropy
3. Any field that appears to have been repeated across rows

Respond with: anomaly flag (yes/no), list of suspicious row indices,
one-sentence explanation per flagged row. No other output.

Dataset: [paste rows here]
LLM anomaly-detection prompt — advisory layer, runs after GX schema check passes

Schema contracts are the floor; LLM anomaly detection is the ceiling. Building a quality pipeline without both is leaving signal on the table.

Related glossary terms