What Fine-Tuning Locally Actually Costs in Hardware
Fine-tuning a model is not the same workload as running inference. Memory requirements roughly triple for full fine-tune (weights + gradients + optimiser state), then halve again for LoRA (trainable adapters only), then halve again for QLoRA (quantised base). Whether your 128GB-unified-memory Mac Studio or single RTX 5090 can fine-tune a given model depends on which of these regimes you pick. This page is the hardware-requirements reference, by model size and method.
Key Findings
- QLoRA on a 70B model is feasible on a single 128GB unified-memory device (DGX Spark, Mac Studio M5 Max), making frontier-class fine-tuning accessible at workstation scale for the first time in 2026.
- Full fine-tune of a 70B model still requires multi-GPU clusters or DGX Spark with extended training time; single-workstation full FT is impractical above 13B parameters.
- LoRA on a 7B model fits comfortably in 24GB VRAM (RTX 4090, 5080) at common batch sizes, the practical entry point for hobbyist fine-tuning.
- Disk I/O matters: dataset preprocessing and checkpoint writing add 20-40 percent wall-clock time on slower NVMe; budget 4TB+ for serious fine-tuning workflows.
- Mac Studio fine-tuning works via PyTorch MPS or MLX but runs roughly 4x slower than equivalent NVIDIA hardware due to CUDA-stack maturity.
Memory Requirements by Method (rule-of-thumb multipliers vs FP16 model size)
| Method | Memory multiplier vs base model size | Trainable params |
|---|---|---|
| Full fine-tune (FP16, AdamW) | ~6x model size in GB | 100% of params |
| Full fine-tune (BF16 + 8-bit Adam) | ~4x model size | 100% |
| LoRA (r=16, FP16 base) | ~1.2x model size | 0.1-1% |
| QLoRA (4-bit base, LoRA r=16) | ~0.4x base FP16 size | 0.1-1% |
| DoRA / RSLoRA (4-bit base) | ~0.5x base FP16 size | 0.5-2% |
Practical Hardware Fit Matrix
| Model | QLoRA | LoRA | Full FT |
|---|---|---|---|
| 7B | RTX 3090 24GB | RTX 4090 24GB | 2x 24GB or single 48GB |
| 13B | RTX 4080 16GB (tight) | RTX 4090 24GB | 2x A100 40GB or single H100 80GB |
| 30B | RTX 4090 24GB | 2x 24GB or 48GB | 4x A100 80GB or DGX cluster |
| 70B | DGX Spark 128GB / Mac M5 Max 128GB | DGX Spark / 2x H100 | 8x H100 80GB cluster |
| 120B (gpt-oss) | DGX Spark / Mac Studio M5 Ultra 192GB | DGX Spark cluster | 16x H100 cluster |
Wall-Clock Time per Epoch (50K-sample dataset, sequence length 2048)
| Workload | DGX Spark | Mac Studio M5 Max | RTX 5090 |
|---|---|---|---|
| QLoRA 7B | ~30 min | ~2 hours | ~45 min |
| LoRA 13B | ~1.5 hours | ~6 hours | ~2 hours (with offload) |
| QLoRA 32B | ~6 hours | ~26 hours | ~9 hours (with offload) |
| QLoRA 70B | ~12 hours | ~52 hours | not feasible |
| QLoRA 120B | ~24 hours | ~96 hours (M5 Ultra only) | not feasible |
Disk and I/O Requirements
Fine-tuning is checkpoint-heavy. A 70B QLoRA run with 5 checkpoints writes roughly 60GB per checkpoint (LoRA adapters are small, but base model copies for resumability are large). Budget:
- Base models cache: 200-500GB (multiple variants and quantizations)
- Datasets and preprocessing artefacts: 50-200GB
- Checkpoints across multiple runs: 200-1000GB
- Total recommended NVMe: 4TB minimum for serious workflows
Software Stack Notes
For NVIDIA hardware, Hugging Face Transformers + PEFT + bitsandbytes is the canonical QLoRA stack. Unsloth is a popular performance-optimised wrapper offering 1.5-2x training speedup on consumer GPUs. For Apple Silicon, MLX LoRA examples and the MLX repo are the primary stacks; PyTorch MPS works but is slower.
Brand Visibility Implications
Fine-tuned brand-aware models are a quietly important AI-visibility surface. Enterprises increasingly fine-tune open-weight base models (Llama 4, Qwen 3, Mistral) on internal documentation, product information, and customer-support transcripts. The resulting models shape internal-employee AI use and increasingly customer-facing chatbot behaviour. Where these fine-tunes happen on local DGX Spark or workstation clusters, the resulting brand-recommendation behaviour is invisible to cloud-API monitoring. As QLoRA on 70B becomes single-workstation-feasible, this surface grows fast.
Methodology
Memory multipliers are rule-of-thumb estimates from Hugging Face PEFT documentation and the original QLoRA paper (Dettmers et al., 2023). Wall-clock figures aggregated from public training runs reported on the Unsloth GitHub and MLX Examples discussions. Real runs vary by sequence length, batch size, optimiser, and gradient-checkpointing settings, treat figures as guidance not warranty. Updated quarterly.
How Presenc AI Helps
Presenc AI tracks brand visibility on enterprise fine-tuned LLM deployments through deployment-side instrumentation, the only visibility available for fine-tuned models that never hit a cloud API. For enterprises shipping internal copilots or customer-facing chatbots on fine-tuned open-weight models, this is the operational answer to "what brands does our model recommend?"