Audit trails, model cards, and datasheets
AI compliance documentation is a refresh problem more than a writing problem. Writing a model card once is easy; keeping it current as the model retrains weekly is the actual challenge. The three main conventions — model cards, datasheets for datasets, and system cards — overlap meaningfully but were each designed for a different audience. Knowing which to produce when saves time you would otherwise spend producing all three.
The retraining loop
Documentation refresh on every training event — diff-only human review is the only cadence that stays current.
Modern AI systems retrain frequently — weekly or nightly in production contexts. Each training event creates a new model artefact with potentially different behaviour, evaluation results, and risk profile. The documentation cadence needs to match this frequency, which means auto-generating a draft from training metadata on each event and routing only the diff to a human reviewer.
An audit log entry for each training event must be immutable (append-only), timestamped, and linked to the training run ID, the corresponding model card version, and the reviewer identity. This chain is the artefact a regulator will inspect: not the card in isolation, but the evidence showing when the model changed, what changed, who reviewed the change, and what the evaluation results were for that version.
Three documentation conventions
Model cards, datasheets, and system cards: different origins, different audiences, different scopes.
The three conventions come from different communities and were designed for different primary audiences. Model cards (Mitchell et al., 2019; later standardised by HuggingFace) document a model's intended use, limitations, and evaluation results by demographic subgroup — their primary audience is practitioners deciding whether to adopt the model. Datasheets for datasets (Gebru et al., 2018) document the training dataset independently of any model, with provenance and preprocessing detail that regulated industries need to audit. System cards (Anthropic and OpenAI practice) document the full AI-powered system including prompts, safety mitigations, and red-team findings — their audience is the deployment context.
Knowing which one a regulator or procurement officer is asking for matters. "Can you provide documentation about your model?" could mean any of the three. Clarify which artefact is required before writing.
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- intent-detection
datasets:
- internal/support-tickets-v4
metrics:
- accuracy
- f1
model-index:
- name: support-intent-classifier-v2
results:
- task:
type: text-classification
dataset:
name: support-tickets-v4
type: internal/support-tickets-v4
metrics:
- type: accuracy
value: 0.91
- type: f1
value: 0.89
---| Convention | Origin | Best fit | What it covers | |
|---|---|---|---|---|
| Model cards | Model cards | Mitchell et al. 2019; standardised by HuggingFace | Per-model documentation accompanying any model release | Model details, intended use, limitations, training data summary, evaluation metrics by demographic subgroup, ethical considerations |
| Datasheets for datasets | Datasheets for datasets | Gebru et al. 2018 | Training dataset documentation, independent of any specific model | Motivation, composition, collection process, preprocessing, intended uses, distribution, maintenance. Especially relevant where training data provenance is subject to audit. |
| System cards | System cards | Anthropic and OpenAI practice; no formal standard | Documenting the full AI-powered system, not just the underlying model | Capabilities, limitations, refusal behaviours, red-team findings, deployment context, pre/post-processing steps. Anthropic's Claude system cards and OpenAI's GPT system cards are the public references. |
| HuggingFace model card spec | HuggingFace model card spec | HuggingFace Hub standardisation of model-card practice | Any model published to the HuggingFace Hub | YAML frontmatter with structured fields for licence, language, library, datasets, and metrics. Machine-parseable by the HF Hub index. |
Documentation conventions for AI compliance, May 2026
Refresh cadence
Per-training-event regen with diff-only human review — anything slower goes stale at modern retraining frequencies.
Per-training-event regen is the right cadence for model cards and system cards in any system that retrains regularly. Anything slower — quarterly, or "when something significant changes" — produces documentation that a regulator will correctly identify as not reflecting the current model.
The practical implementation: auto-generate a draft card from training metadata (model version, dataset version, hyperparameters, evaluation results) using a templated script. Route the diff — not the full card — to a human reviewer. The reviewer's job is to assess whether the change in behaviour implied by the eval results warrants updated risk statements, not to re-read the entire document from scratch.
// PRODUCTION
Audit trail composition
What a regulator will inspect: the chain of evidence linking training events to evaluation results and reviewer sign-offs.
An audit trail that satisfies regulatory inspection needs four components for each model version: the training run metadata (dataset version, hyperparameters, training run ID, timestamp), the evaluation results for that run (linked to the run by ID — not a separate document that could be substituted), the reviewer sign-off (timestamped, with reviewer identity), and the deployment record (when this version entered production and, once replaced, when it was retired).
The traceability question — mapping which requirements are covered by which tests, and which test results are associated with which model version — is covered in detail on the traceability sub-page. PII handling in audit logs is a separate concern: audit logs can contain sensitive operational data, and the same controls that apply to production data apply here.