AI for data quality validation
AI-generated test data without quality validation is rolling the dice. Schema-conformance, statistical realism, referential integrity — each fails differently and requires different detection. The tools have matured fast. GX (formerly Great Expectations) with its 20M+ monthly downloads has become the de-facto open-source standard for schema contracts. LLM-based anomaly detection has emerged as a useful complement, not a replacement. Knowing which to combine for which use case is the design decision that determines whether your data quality pipeline is a safety net or a false sense of security.
Four quality dimensions
Schema conformance, distribution realism, referential integrity, and anomaly detection — each fails differently.
Test data quality fails in four distinct ways, each requiring a different detection mechanism. Schema conformance checks whether each row matches the declared contract — column types, nullability, format constraints, value ranges. This is the most common quality check and the one most tooling handles well.
Distribution realism checks whether the column-level statistics of your synthetic data still resemble production: if the age distribution in your user fixture has drifted significantly from the production baseline, tests that depend on representative data will produce misleading results. This check requires a baseline snapshot from production and a comparison threshold.
Referential integrity checks whether foreign keys in synthetic data point to records that actually exist in the same dataset — a common failure mode when synthetic tables are generated independently rather than as a joined set. A user fixture with order records that reference non-existent user IDs produces test failures that look like application bugs, not data bugs.
Anomaly detection checks for anything statistically unusual that none of the above checks would catch: a column where ninety percent of values are identical (low entropy), a date column where all values cluster in a single week, or a numeric column where one row has a value several standard deviations from the rest. LLM-based anomaly detection is the most useful here — it surfaces patterns that deterministic rules would miss.
The validation flow
Schema check first, distribution diff second, anomaly detection third — hard block on failure, advisory on anomaly.
The validation pipeline applies checks in order of certainty. Schema conformance is deterministic — a row either matches the contract or it does not. A failure here is a hard CI block: the fixture cannot be used until the schema violation is resolved. Distribution and referential integrity checks are configurable-threshold failures: small deviations are informational, large deviations block.
Anomaly detection runs last and is advisory: it ranks rows by suspicion and surfaces them for review rather than blocking. Blocking on an LLM anomaly flag would produce too many false positives; surfacing anomalies for human review produces signal that would otherwise require manual inspection of thousands of rows to find.
Vendor landscape, May 2026
GX is the open-source standard; Soda offers a YAML-first alternative; Monte Carlo adds observability at scale.
The data-quality tooling landscape has consolidated around a few clear patterns. GX (formerly Great Expectations) has 20M+ monthly downloads, making it the most widely adopted open-source data quality library; GX Cloud is the commercial product layered above the open-source core. Soda Core and Monte Carlo address different points on the spectrum from developer-owned to platform-managed quality.
| Tool | Open-source | AI features | Best fit | |
|---|---|---|---|---|
| GX (formerly Great Expectations) | GX Core open-source library (20M+ monthly downloads, 13k+ community); GX Cloud commercial product | ●Yes (GX Core) | GX Cloud uses LLM for expectation suggestion from data samples | Teams wanting the most-adopted open-source standard with optional commercial uplift |
| Soda Core | Open-source SodaCL declarative DSL for data quality; YAML-driven checks | Yes (Soda Core) | Soda Cloud adds ML-driven anomaly detection and alerting | Teams who prefer YAML over Python for check definition |
| Monte Carlo | Commercial data observability platform; ML-driven anomaly detection across full data pipelines | No | Core to the product — anomaly detection is the primary value proposition | Large data organisations wanting end-to-end pipeline observability and alerting |
| LLM-as-validator (custom) | Direct prompt: "does this dataset look anomalous given this baseline?" — advisory only | N/A | The entire tool | Exploratory anomaly checks; never as the only validation layer |
Data-quality tools, May 2026
LLM-as-validator — the myth
Pattern matching is not formal verification — use LLMs for anomaly suggestion, schema contracts for hard validation.
The appeal of LLM-as-validator is real: describe your dataset and ask whether it looks anomalous. The result often contains genuinely useful signal. The problem is what the LLM cannot do: provide a formal guarantee. A model that says "this row looks reasonable" has applied pattern matching against its training distribution, not a deterministic contract check. The false-negative rate is unknown and unknowable without systematic evaluation.
The appropriate role for LLM validation is anomaly suggestion, not pass/fail determination. Run GX or Soda first — those checks are deterministic and provide the hard quality floor. Run an LLM pass second, to surface patterns the schema contract would not catch: unusual clustering, low-entropy columns, outliers that are technically schema-valid but statistically implausible.
// MYTH
A practical stack
GX Core for the floor, LLM anomaly detection for the ceiling — Monte Carlo or Soda Cloud if budget allows.
The practical starting point for a QA team building a data quality pipeline requires no commercial tooling. GX Core (open-source) handles schema conformance and expectations as code — deterministic, version-controlled, fails CI hard on contract violation. An LLM anomaly-detection prompt run as a post-generation advisory check surfaces statistical anomalies that the expectation suite would not catch. Together they cover the floor and the ceiling of data quality validation.
For teams with budget and scale: Soda Cloud adds managed anomaly detection without requiring LLM prompt engineering; Monte Carlo adds full pipeline observability across every upstream source. Both are worthwhile at large data-engineering scale. For a QA team running a test fixture pipeline, GX Core plus an LLM advisory pass is the most cost-effective starting point.
# GX (formerly Great Expectations) — schema + expectations as code
import great_expectations as gx
context = gx.get_context()
suite = context.add_expectation_suite("synthetic_users_v1")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="synthetic_users_v1"
)
validator.expect_column_to_exist("email")
validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_match_regex(
"email", r"^[^@\s]+@[^@\s]+\.[^@\s]+$"
)
validator.expect_column_values_to_be_between("age", min_value=18, max_value=75)
validator.expect_column_value_lengths_to_be_between("postcode", min_value=5, max_value=8)
validator.expect_column_values_to_be_in_set("tier", ["bronze", "silver", "gold"])
results = validator.validate()
assert results.success, f"Data quality check failed: {results}"# LLM anomaly detection — advisory only, not a hard CI block # Run after GX validation passes; surfaces statistical anomalies You are reviewing a synthetic test dataset for anomalies. The dataset is a JSON array of user records (sample shown below). Baseline statistics (from last production snapshot): - Age: median 42, range 18–75, roughly normal distribution - Tier: 60% bronze, 30% silver, 10% gold - Postcodes: mix of UK formats, realistic regional clustering Review the records below and identify: 1. Any row that is statistically anomalous relative to the baseline 2. Any column showing unexpected value clustering or low entropy 3. Any field that appears to have been repeated across rows Respond with: anomaly flag (yes/no), list of suspicious row indices, one-sentence explanation per flagged row. No other output. Dataset: [paste rows here]
Schema contracts are the floor; LLM anomaly detection is the ceiling. Building a quality pipeline without both is leaving signal on the table.