Audit trails, model cards, and datasheets

9 min read · Reviewed May 2026

AI compliance documentation is a refresh problem more than a writing problem. Writing a model card once is easy; keeping it current as the model retrains weekly is the actual challenge. The three main conventions — model cards, datasheets for datasets, and system cards — overlap meaningfully but were each designed for a different audience. Knowing which to produce when saves time you would otherwise spend producing all three.

READ TIME9 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNThree documentation conventions — model cards, datasheets for datasets, and system cards — what each one is for, and the refresh discipline that keeps them from going stale.

The retraining loop

Documentation refresh on every training event — diff-only human review is the only cadence that stays current.

Modern AI systems retrain frequently — weekly or nightly in production contexts. Each training event creates a new model artefact with potentially different behaviour, evaluation results, and risk profile. The documentation cadence needs to match this frequency, which means auto-generating a draft from training metadata on each event and routing only the diff to a human reviewer.

An audit log entry for each training event must be immutable (append-only), timestamped, and linked to the training run ID, the corresponding model card version, and the reviewer identity. This chain is the artefact a regulator will inspect: not the card in isolation, but the evidence showing when the model changed, what changed, who reviewed the change, and what the evaluation results were for that version.

Documentation refresh on every training event

Three documentation conventions

Model cards, datasheets, and system cards: different origins, different audiences, different scopes.

The three conventions come from different communities and were designed for different primary audiences. Model cards (Mitchell et al., 2019; later standardised by HuggingFace) document a model's intended use, limitations, and evaluation results by demographic subgroup — their primary audience is practitioners deciding whether to adopt the model. Datasheets for datasets (Gebru et al., 2018) document the training dataset independently of any model, with provenance and preprocessing detail that regulated industries need to audit. System cards (Anthropic and OpenAI practice) document the full AI-powered system including prompts, safety mitigations, and red-team findings — their audience is the deployment context.

Knowing which one a regulator or procurement officer is asking for matters. "Can you provide documentation about your model?" could mean any of the three. Clarify which artefact is required before writing.

---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - text-classification
  - intent-detection
datasets:
  - internal/support-tickets-v4
metrics:
  - accuracy
  - f1
model-index:
  - name: support-intent-classifier-v2
    results:
      - task:
          type: text-classification
        dataset:
          name: support-tickets-v4
          type: internal/support-tickets-v4
        metrics:
          - type: accuracy
            value: 0.91
          - type: f1
            value: 0.89
---

HuggingFace model card YAML frontmatter — structured fields parseable by the HF Hub index

	Convention	Origin	Best fit	What it covers
Model cards	Model cards	Mitchell et al. 2019; standardised by HuggingFace	Per-model documentation accompanying any model release	Model details, intended use, limitations, training data summary, evaluation metrics by demographic subgroup, ethical considerations
Datasheets for datasets	Datasheets for datasets	Gebru et al. 2018	Training dataset documentation, independent of any specific model	Motivation, composition, collection process, preprocessing, intended uses, distribution, maintenance. Especially relevant where training data provenance is subject to audit.
System cards	System cards	Anthropic and OpenAI practice; no formal standard	Documenting the full AI-powered system, not just the underlying model	Capabilities, limitations, refusal behaviours, red-team findings, deployment context, pre/post-processing steps. Anthropic's Claude system cards and OpenAI's GPT system cards are the public references.
HuggingFace model card spec	HuggingFace model card spec	HuggingFace Hub standardisation of model-card practice	Any model published to the HuggingFace Hub	YAML frontmatter with structured fields for licence, language, library, datasets, and metrics. Machine-parseable by the HF Hub index.

Documentation conventions for AI compliance, May 2026

Refresh cadence

Per-training-event regen with diff-only human review — anything slower goes stale at modern retraining frequencies.

Per-training-event regen is the right cadence for model cards and system cards in any system that retrains regularly. Anything slower — quarterly, or "when something significant changes" — produces documentation that a regulator will correctly identify as not reflecting the current model.

The practical implementation: auto-generate a draft card from training metadata (model version, dataset version, hyperparameters, evaluation results) using a templated script. Route the diff — not the full card — to a human reviewer. The reviewer's job is to assess whether the change in behaviour implied by the eval results warrants updated risk statements, not to re-read the entire document from scratch.

// PRODUCTION

Per-training-event regen with diff-only human review is the only refresh cadence that stays current at modern retraining frequencies. Models retraining weekly or nightly produce more documentation pressure than full per-training reviews can manage. The diff is what humans review — not the full card.

Audit trail composition

What a regulator will inspect: the chain of evidence linking training events to evaluation results and reviewer sign-offs.

An audit trail that satisfies regulatory inspection needs four components for each model version: the training run metadata (dataset version, hyperparameters, training run ID, timestamp), the evaluation results for that run (linked to the run by ID — not a separate document that could be substituted), the reviewer sign-off (timestamped, with reviewer identity), and the deployment record (when this version entered production and, once replaced, when it was retired).

The traceability question — mapping which requirements are covered by which tests, and which test results are associated with which model version — is covered in detail on the traceability sub-page. PII handling in audit logs is a separate concern: audit logs can contain sensitive operational data, and the same controls that apply to production data apply here.

ai:AI-assisted traceability matrices ai:PII-safe synthetic data