Bias and fairness testing
Bias testing in AI has been a mature sub-field for nearly a decade. The hard part is not running the metrics — it is choosing which metric. Demographic parity, equalised odds, calibration, and predictive parity are mathematically distinct, and Kleinberg, Mullainathan and Raghavan's 2016 impossibility result proved you cannot satisfy all of them simultaneously except in degenerate cases. The work is picking the right metric for your use case and accepting the trade-off explicitly.
The four metrics
Demographic parity, equalised odds, predictive parity, and treatment equality — each optimises for a different definition of fairness.
Four fairness metrics dominate practitioner use. Demographic parity (also called statistical parity) requires that the model produces positive outcomes at equal rates across demographic groups, regardless of base rates. Equalised odds requires that the true positive rate and false positive rate are equal across groups — it is the standard in contexts where both false positives and false negatives carry significant cost. Predictive parity (calibration) requires that among all individuals who receive a given score, the proportion who experience the predicted outcome is the same across groups. Treatment equality requires that the ratio of false negatives to false positives is equal across groups.
The matrix below maps each metric against four deployment contexts. "Full" indicates the metric is typically the primary choice for that context; "partial" indicates it is a secondary consideration; "none" indicates it is generally not appropriate or informative for that context.
Kleinberg's impossibility result
You cannot simultaneously satisfy demographic parity, equalised odds, and calibration — this is a mathematical result, not an engineering gap waiting to be closed.
Kleinberg, Mullainathan, and Raghavan's 2016 paper 'Inherent Trade-Offs in the Fair Determination of Risk Scores' proved that demographic parity, equalised odds, and calibration cannot all be satisfied simultaneously except when base rates are identical across groups — a condition that rarely holds in practice. This is a mathematical impossibility result. No better algorithm will satisfy all three metrics at once on real-world data with unequal base rates.
The practitioner consequence is that arguments about "the unfair model" frequently dissolve into arguments about which fairness metric was used. A model optimised for demographic parity will necessarily violate calibration when base rates differ. A model optimised for equalised odds will necessarily violate demographic parity in most real datasets. Choosing the metric is choosing whose fairness definition to prioritise.
The correct response is not to treat this as a reason to avoid fairness testing. It is to choose one metric to optimise, document the trade-off explicitly, and defend the choice against stakeholders and regulators who will ask. "We chose equalised odds for the healthcare triage model because we judged false positives and false negatives to carry equal cost" is a defensible position. "We optimised for all fairness metrics simultaneously" is not.
Stable open-source toolkits
Fairlearn and AIF360 have both stayed stable through the LLM-era turbulence — the mathematics is older than the current vendor cycle.
Fairlearn is a community-stewarded library (Microsoft-seeded, now independently governed) for fairness assessment and mitigation in tabular ML pipelines. It integrates with scikit-learn, exposes MetricFrame for disaggregated metrics across sensitive groups, and includes mitigation algorithms (reductions and post-processing approaches). Best fit: tabular classification and regression models where scikit-learn is already in the stack.
AIF360 — at v0.6.1 on aif360.readthedocs.io — is the IBM-seeded fairness toolkit with broader metric coverage than Fairlearn and a steeper learning curve. A note on infrastructure: the older aif360.res.ibm.com URL has certificate issues; always link to the readthedocs documentation, never the old IBM URL. Best fit: research contexts and cases requiring more nuanced metric coverage than Fairlearn provides.
Both toolkits have stayed largely stable through the turbulence of the 2024–2026 AI vendor cycle. Fairness mathematics is not moving as fast as benchmark leaderboards or eval platform positioning. When the ecosystem around you is churning, stable tooling with a long maintenance record is worth its weight.
from fairlearn.metrics import (
MetricFrame,
demographic_parity_difference,
equalized_odds_difference,
)
from sklearn.metrics import accuracy_score
# y_true: ground-truth labels
# y_pred: model predictions
# sensitive_features: group membership (e.g. "gender", "race")
mf = MetricFrame(
metrics={"accuracy": accuracy_score},
y_true=y_true,
y_pred=y_pred,
sensitive_features=sensitive_features,
)
print(mf.by_groups) # accuracy per group
print(mf.difference()) # max group accuracy gap
dp = demographic_parity_difference(
y_true, y_pred, sensitive_features=sensitive_features
)
eo = equalized_odds_difference(
y_true, y_pred, sensitive_features=sensitive_features
)
print(f"DP difference: {dp:.4f}")
print(f"EO difference: {eo:.4f}")LLM-era complications
Classification metrics need adaptation for generative models — per-group quality measurement and refusal-rate analysis are the current best practice.
Generative models do not fit the classification-metric frame cleanly. There are no discrete predicted labels to compare against ground-truth labels; outputs are free-form text. Applying demographic parity or equalised odds directly requires reducing the output to a binary label — harmful / not harmful, correct / not correct — which discards most of the information about output quality.
Three emerging measurement approaches are gaining traction for generative AI fairness. Per-group quality measurement asks: does the model produce outputs of equal quality when writing about, or responding to users from, different demographic groups? Refusal-rate analysis asks: does the model refuse requests at different rates for different groups? Representation analysis asks: does the model's generated text contain stereotyped or under-represented portrayals of particular groups?
These approaches are less standardised than the classical classification metrics and have less tooling support. Fairlearn and AIF360 do not natively support them; per-group quality measurement typically requires a custom evaluation pipeline using one of the frameworks covered in the eval-platforms-and-tooling guide. Tooling consolidation is expected over 2026–2027.
// NOTE
The dataset side of bias
Bias originates in training data more often than in model architecture — dataset audits are upstream of any model-side mitigation.
Bias rarely originates in the model architecture; it originates in the training data. Under-represented groups produce less training signal; label quality varies by group when human annotators bring consistent biases; historical outcomes encoded in datasets reflect historical injustice, which the model learns as signal rather than artefact.
Auditing the dataset for under-represented groups, label-quality variance by group, and historically-biased outcome labels is upstream of any model-side mitigation. The same principles that apply to PII handling in synthetic test data apply here: every eval dataset is a data-capture event, and representational gaps compound over evaluation cycles.
Model-side fairness fixes — reweighting, adversarial de-biasing, post-hoc threshold adjustment — are downstream patches for upstream data problems. They are still worth applying; the point is that data investment compounds over time and model-side patches do not.