Audit trails, model cards, and datasheets

9 min read · Reviewed May 2026

AI compliance documentation is a refresh problem more than a writing problem. Writing a model card once is easy; keeping it current as the model retrains weekly is the actual challenge. The three main conventions — model cards, datasheets for datasets, and system cards — overlap meaningfully but were each designed for a different audience. Knowing which to produce when saves time you would otherwise spend producing all three.

READ TIME9 min
DIFFICULTYintermediate
REVIEWEDMay 2026
YOU'LL LEARNThree documentation conventions — model cards, datasheets for datasets, and system cards — what each one is for, and the refresh discipline that keeps them from going stale.

The retraining loop

Documentation refresh on every training event — diff-only human review is the only cadence that stays current.

Modern AI systems retrain frequently — weekly or nightly in production contexts. Each training event creates a new model artefact with potentially different behaviour, evaluation results, and risk profile. The documentation cadence needs to match this frequency, which means auto-generating a draft from training metadata on each event and routing only the diff to a human reviewer.

An audit log entry for each training event must be immutable (append-only), timestamped, and linked to the training run ID, the corresponding model card version, and the reviewer identity. This chain is the artefact a regulator will inspect: not the card in isolation, but the evidence showing when the model changed, what changed, who reviewed the change, and what the evaluation results were for that version.

Flow diagramProcess flow: Training event → Auto-regen card draft → Human review → Audit log entryTraining eventnew model versi…Auto-regen ca…from training m…Human reviewchanges since l…Audit log ent…immutable record
Documentation refresh on every training event

Three documentation conventions

Model cards, datasheets, and system cards: different origins, different audiences, different scopes.

The three conventions come from different communities and were designed for different primary audiences. Model cards (Mitchell et al., 2019; later standardised by HuggingFace) document a model's intended use, limitations, and evaluation results by demographic subgroup — their primary audience is practitioners deciding whether to adopt the model. Datasheets for datasets (Gebru et al., 2018) document the training dataset independently of any model, with provenance and preprocessing detail that regulated industries need to audit. System cards (Anthropic and OpenAI practice) document the full AI-powered system including prompts, safety mitigations, and red-team findings — their audience is the deployment context.

Knowing which one a regulator or procurement officer is asking for matters. "Can you provide documentation about your model?" could mean any of the three. Clarify which artefact is required before writing.

---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - text-classification
  - intent-detection
datasets:
  - internal/support-tickets-v4
metrics:
  - accuracy
  - f1
model-index:
  - name: support-intent-classifier-v2
    results:
      - task:
          type: text-classification
        dataset:
          name: support-tickets-v4
          type: internal/support-tickets-v4
        metrics:
          - type: accuracy
            value: 0.91
          - type: f1
            value: 0.89
---
HuggingFace model card YAML frontmatter — structured fields parseable by the HF Hub index
ConventionOriginBest fitWhat it covers
Model cardsModel cardsMitchell et al. 2019; standardised by HuggingFacePer-model documentation accompanying any model releaseModel details, intended use, limitations, training data summary, evaluation metrics by demographic subgroup, ethical considerations
Datasheets for datasetsDatasheets for datasetsGebru et al. 2018Training dataset documentation, independent of any specific modelMotivation, composition, collection process, preprocessing, intended uses, distribution, maintenance. Especially relevant where training data provenance is subject to audit.
System cardsSystem cardsAnthropic and OpenAI practice; no formal standardDocumenting the full AI-powered system, not just the underlying modelCapabilities, limitations, refusal behaviours, red-team findings, deployment context, pre/post-processing steps. Anthropic's Claude system cards and OpenAI's GPT system cards are the public references.
HuggingFace model card specHuggingFace model card specHuggingFace Hub standardisation of model-card practiceAny model published to the HuggingFace HubYAML frontmatter with structured fields for licence, language, library, datasets, and metrics. Machine-parseable by the HF Hub index.

Documentation conventions for AI compliance, May 2026

Refresh cadence

Per-training-event regen with diff-only human review — anything slower goes stale at modern retraining frequencies.

Per-training-event regen is the right cadence for model cards and system cards in any system that retrains regularly. Anything slower — quarterly, or "when something significant changes" — produces documentation that a regulator will correctly identify as not reflecting the current model.

The practical implementation: auto-generate a draft card from training metadata (model version, dataset version, hyperparameters, evaluation results) using a templated script. Route the diff — not the full card — to a human reviewer. The reviewer's job is to assess whether the change in behaviour implied by the eval results warrants updated risk statements, not to re-read the entire document from scratch.

// PRODUCTION

Per-training-event regen with diff-only human review is the only refresh cadence that stays current at modern retraining frequencies. Models retraining weekly or nightly produce more documentation pressure than full per-training reviews can manage. The diff is what humans review — not the full card.

Audit trail composition

What a regulator will inspect: the chain of evidence linking training events to evaluation results and reviewer sign-offs.

An audit trail that satisfies regulatory inspection needs four components for each model version: the training run metadata (dataset version, hyperparameters, training run ID, timestamp), the evaluation results for that run (linked to the run by ID — not a separate document that could be substituted), the reviewer sign-off (timestamped, with reviewer identity), and the deployment record (when this version entered production and, once replaced, when it was retired).

The traceability question — mapping which requirements are covered by which tests, and which test results are associated with which model version — is covered in detail on the traceability sub-page. PII handling in audit logs is a separate concern: audit logs can contain sensitive operational data, and the same controls that apply to production data apply here.