Why You Need a Training Data Audit
Training data is the single biggest liability surface in AI. Most lawsuits, regulatory actions, and PR incidents trace back to undocumented or under-vetted training data. A formal audit makes the data lineage auditable, surfaces licensing risk early, and is increasingly required by enterprise customers and regulators. This template is designed for any team training or fine-tuning AI models on proprietary or third-party data.
Section 1: Dataset Inventory
For every dataset used in training or fine-tuning, capture:
| Field | Why it matters |
|---|---|
| Dataset name + version | Auditability |
| Source (URL, vendor, internal system) | Provenance |
| Acquisition method (download, API, purchase, internal collection) | Licensing trail |
| Acquisition date | Snapshot timing |
| License (with link to license text) | Use rights |
| Volume (rows, tokens, bytes) | Scale |
| Modality (text, image, audio, video) | Risk surface |
| Contains PII (yes / no / unknown) | Privacy |
| Contains copyrighted material (yes / no / unknown) | IP |
| Geographic scope | Compliance jurisdictions |
| Last reviewed | Freshness of audit |
| Owner (named person) | Accountability |
Section 2: Provenance Verification
- Confirm each dataset's listed source matches the actual file hash / sample.
- Cross-check upstream license text against current vendor terms (license can change).
- Verify any "open" license restrictions: attribution, non-commercial clauses, share-alike requirements.
- Document chain-of-custody for any dataset that passed through an intermediary.
- Flag datasets with unclear or conflicting provenance for legal review.
Section 3: PII and Privacy Screening
- Run automated PII detection across each dataset (names, emails, phone, SSNs, IDs).
- If PII is found and not legally permissible, the dataset is removed or remediated.
- If PII is permissible (consented, properly anonymised), document the legal basis.
- For PII subject to GDPR / CCPA: document data subject rights compliance path.
- Document retention and deletion procedures.
Section 4: Licensing Review
| License type | Action |
|---|---|
| CC0 / Public Domain | Generally safe for any use |
| CC-BY | Use permitted with attribution in model card |
| CC-BY-SA | Use permitted but downstream model may need to inherit share-alike |
| CC-BY-NC | Non-commercial only; not suitable for commercial models |
| Custom permissive (e.g., MIT, Apache) | Check applicability to data (often these are code licenses misapplied) |
| Commercial vendor license | Use per contract; check scope, duration, downstream rights |
| Web-scraped / no clear license | Flag for legal; document robots.txt compliance and terms-of-service review |
| User-generated content (consented) | Confirm consent scope covers training |
Section 5: Representativeness and Bias Review
- Document demographic and geographic distribution where relevant.
- Compare distribution to target deployment population.
- Flag known underrepresentation or overrepresentation.
- Document remediation steps (oversampling, weighting, exclusion).
- Plan recurring bias review post-deployment.
Section 6: Contamination Check
- Verify training data does not include benchmark or evaluation sets the model will be tested on.
- Run deduplication against major public benchmarks (MMLU, HumanEval, GSM8K, SWE-Bench).
- For domain models: deduplicate against the company's own eval set.
- Document any near-duplicate matches and remediation.
Section 7: Documentation Requirements
Maintain a model card per trained or fine-tuned model that includes:
- Datasets used (with the inventory fields above).
- Training procedure summary.
- Known limitations and bias caveats.
- Intended use and out-of-scope use.
- Evaluation results.
- Contact for questions.
Section 8: Recurring Audit Cadence
- Full audit before model launch.
- Lightweight refresh quarterly for production models.
- Full re-audit on every retraining or significant fine-tune.
- Spot-audit when new vendor or upstream license terms change.
- External audit annually for high-risk or regulated models.
Red Flags
- "Unknown" provenance on more than 5% of training tokens.
- Web scrapes without documented robots.txt and ToS compliance.
- PII present with no documented legal basis.
- Benchmark contamination not deduplicated.
- Licensing review last completed more than a year ago.
- No named owner for a major dataset.