What is the smallest GPU that can QLoRA-fine-tune a 7B model?

A 16GB GPU is the practical floor (RTX 4060 Ti 16GB, RTX 4080) at small batch sizes. 24GB (RTX 3090, 4090) is comfortable. Below 12GB, you are restricted to fine-tuning models smaller than 7B or using extreme gradient accumulation that makes training impractically slow.

Can I full-fine-tune a 7B model on a single 4090?

Tight but feasible with BF16 + 8-bit Adam optimiser and gradient checkpointing. You will need to drop batch size to 1-2 and accumulate gradients across many steps. LoRA is dramatically more efficient and produces comparable downstream task quality for most use cases; full fine-tune is rarely worth the hardware penalty at 7B.

Is QLoRA quality competitive with full fine-tune?

Yes within a few percent on most evaluation benchmarks for instruction-tuning and domain-adaptation tasks. Full fine-tune wins for tasks requiring large-scale knowledge updates to the base model, but those are rare in production fine-tuning workflows. QLoRA is the practical default for 2026.

How long does it take to fine-tune Llama 4 70B on DGX Spark?

QLoRA on a 50K-sample instruction dataset takes approximately 12 hours per epoch on DGX Spark. Most production fine-tunes run 2-4 epochs, so budget 24-48 hours wall-clock per fine-tune. Mac Studio M5 Max is roughly 4x slower for the same workload.

Do I need DGX Spark or can I rent cloud GPUs?

For one-off experiments, cloud rental is cheaper. For continuous fine-tuning workflows (data privacy, weekly retraining cadence, iteration), local DGX Spark amortises favourably within 6-12 months at moderate utilisation. For data-sensitive enterprises, local hardware is often mandatory regardless of economics.

Local LLM Fine-Tuning Hardware Requirements 2026

What Fine-Tuning Locally Actually Costs in Hardware

Fine-tuning a model is not the same workload as running inference. Memory requirements roughly triple for full fine-tune (weights + gradients + optimiser state), then halve again for LoRA (trainable adapters only), then halve again for QLoRA (quantised base). Whether your 128GB-unified-memory Mac Studio or single RTX 5090 can fine-tune a given model depends on which of these regimes you pick. This page is the hardware-requirements reference, by model size and method.

Key Findings

QLoRA on a 70B model is feasible on a single 128GB unified-memory device (DGX Spark, Mac Studio M5 Max), making frontier-class fine-tuning accessible at workstation scale for the first time in 2026.
Full fine-tune of a 70B model still requires multi-GPU clusters or DGX Spark with extended training time; single-workstation full FT is impractical above 13B parameters.
LoRA on a 7B model fits comfortably in 24GB VRAM (RTX 4090, 5080) at common batch sizes, the practical entry point for hobbyist fine-tuning.
Disk I/O matters: dataset preprocessing and checkpoint writing add 20-40 percent wall-clock time on slower NVMe; budget 4TB+ for serious fine-tuning workflows.
Mac Studio fine-tuning works via PyTorch MPS or MLX but runs roughly 4x slower than equivalent NVIDIA hardware due to CUDA-stack maturity.

Memory Requirements by Method (rule-of-thumb multipliers vs FP16 model size)

Method	Memory multiplier vs base model size	Trainable params
Full fine-tune (FP16, AdamW)	~6x model size in GB	100% of params
Full fine-tune (BF16 + 8-bit Adam)	~4x model size	100%
LoRA (r=16, FP16 base)	~1.2x model size	0.1-1%
QLoRA (4-bit base, LoRA r=16)	~0.4x base FP16 size	0.1-1%
DoRA / RSLoRA (4-bit base)	~0.5x base FP16 size	0.5-2%

Practical Hardware Fit Matrix

Model	QLoRA	LoRA	Full FT
7B	RTX 3090 24GB	RTX 4090 24GB	2x 24GB or single 48GB
13B	RTX 4080 16GB (tight)	RTX 4090 24GB	2x A100 40GB or single H100 80GB
30B	RTX 4090 24GB	2x 24GB or 48GB	4x A100 80GB or DGX cluster
70B	DGX Spark 128GB / Mac M5 Max 128GB	DGX Spark / 2x H100	8x H100 80GB cluster
120B (gpt-oss)	DGX Spark / Mac Studio M5 Ultra 192GB	DGX Spark cluster	16x H100 cluster

Wall-Clock Time per Epoch (50K-sample dataset, sequence length 2048)

Workload	DGX Spark	Mac Studio M5 Max	RTX 5090
QLoRA 7B	~30 min	~2 hours	~45 min
LoRA 13B	~1.5 hours	~6 hours	~2 hours (with offload)
QLoRA 32B	~6 hours	~26 hours	~9 hours (with offload)
QLoRA 70B	~12 hours	~52 hours	not feasible
QLoRA 120B	~24 hours	~96 hours (M5 Ultra only)	not feasible

Disk and I/O Requirements

Fine-tuning is checkpoint-heavy. A 70B QLoRA run with 5 checkpoints writes roughly 60GB per checkpoint (LoRA adapters are small, but base model copies for resumability are large). Budget:

Base models cache: 200-500GB (multiple variants and quantizations)
Datasets and preprocessing artefacts: 50-200GB
Checkpoints across multiple runs: 200-1000GB
Total recommended NVMe: 4TB minimum for serious workflows

Software Stack Notes

For NVIDIA hardware, Hugging Face Transformers + PEFT + bitsandbytes is the canonical QLoRA stack. Unsloth is a popular performance-optimised wrapper offering 1.5-2x training speedup on consumer GPUs. For Apple Silicon, MLX LoRA examples and the MLX repo are the primary stacks; PyTorch MPS works but is slower.

Brand Visibility Implications

Fine-tuned brand-aware models are a quietly important AI-visibility surface. Enterprises increasingly fine-tune open-weight base models (Llama 4, Qwen 3, Mistral) on internal documentation, product information, and customer-support transcripts. The resulting models shape internal-employee AI use and increasingly customer-facing chatbot behaviour. Where these fine-tunes happen on local DGX Spark or workstation clusters, the resulting brand-recommendation behaviour is invisible to cloud-API monitoring. As QLoRA on 70B becomes single-workstation-feasible, this surface grows fast.

Methodology

Memory multipliers are rule-of-thumb estimates from Hugging Face PEFT documentation and the original QLoRA paper (Dettmers et al., 2023). Wall-clock figures aggregated from public training runs reported on the Unsloth GitHub and MLX Examples discussions. Real runs vary by sequence length, batch size, optimiser, and gradient-checkpointing settings, treat figures as guidance not warranty. Updated quarterly.

How Presenc AI Helps

Presenc AI tracks brand visibility on enterprise fine-tuned LLM deployments through deployment-side instrumentation, the only visibility available for fine-tuned models that never hit a cloud API. For enterprises shipping internal copilots or customer-facing chatbots on fine-tuned open-weight models, this is the operational answer to "what brands does our model recommend?"