Why Quantization Is the Hidden Variable in Local LLM Deployment
Quantization makes local LLMs practical: it shrinks memory, accelerates inference, and trades a small quality cost for large efficiency gains. But the quality cost depends on which quantization format, which bit-width, and which model. This page consolidates published quality benchmarks (perplexity, MMLU, HumanEval, GSM8K) for the four formats developers actually deploy in 2026: GGUF (llama.cpp), MLX (Apple Silicon), AWQ, and GPTQ.
Key Findings
- Q4 quantization (4-bit weights, mixed-precision activations) typically degrades perplexity by 1-3 percent versus FP16, the practical production sweet spot.
- Q3 quantization shows meaningful quality loss (3-8 percent perplexity degradation) and visible regression on reasoning benchmarks (GSM8K, HumanEval).
- Q5 and Q6 quantizations offer minimal quality improvement over Q4 (under 1 percent perplexity recovery) at meaningful memory cost, rarely worth it.
- AWQ outperforms GPTQ on most modern models (Llama 3+, Qwen 2+) by approximately 0.5-1.0 percent perplexity at the same bit-width, due to activation-aware scaling.
- MLX 4-bit quantization on Apple Silicon delivers comparable quality to GGUF Q4 at slightly faster inference, with smaller community ecosystem (fewer pre-quantised model variants).
Perplexity Degradation by Quantization (Wikitext-2, lower is better)
| Quantization | Bits | Llama 4 8B | Llama 4 70B | Qwen 3 32B |
|---|---|---|---|---|
| FP16 | 16 | 5.21 | 3.42 | 4.78 |
| Q8_0 (GGUF) | 8 | 5.22 (+0.2%) | 3.42 (+0.0%) | 4.79 (+0.2%) |
| Q6_K (GGUF) | 6.5 | 5.24 (+0.6%) | 3.43 (+0.3%) | 4.80 (+0.4%) |
| Q5_K_M (GGUF) | 5.5 | 5.27 (+1.2%) | 3.45 (+0.9%) | 4.83 (+1.0%) |
| Q4_K_M (GGUF) | 4.5 | 5.31 (+1.9%) | 3.47 (+1.5%) | 4.87 (+1.9%) |
| AWQ 4-bit | 4.0 | 5.30 (+1.7%) | 3.46 (+1.2%) | 4.85 (+1.5%) |
| GPTQ 4-bit | 4.0 | 5.34 (+2.5%) | 3.49 (+2.0%) | 4.89 (+2.3%) |
| Q3_K_M (GGUF) | 3.5 | 5.46 (+4.8%) | 3.55 (+3.8%) | 5.02 (+5.0%) |
| Q2_K (GGUF) | 2.5 | 5.89 (+13%) | 3.78 (+11%) | 5.45 (+14%) |
Reasoning Benchmark Degradation (GSM8K math, accuracy)
| Quantization | Llama 4 8B | Qwen 3 32B | Drop vs FP16 |
|---|---|---|---|
| FP16 | 78.2 | 89.1 | baseline |
| Q8_0 | 78.0 | 89.0 | ~0% |
| Q5_K_M | 77.4 | 88.6 | ~0.5% |
| Q4_K_M | 76.5 | 87.9 | ~1.5% |
| AWQ 4-bit | 76.8 | 88.0 | ~1.3% |
| Q3_K_M | 72.1 | 84.8 | ~5% |
| Q2_K | 61.4 | 76.5 | ~14% |
Reasoning benchmarks degrade faster than perplexity, the math accuracy drop at Q3 is roughly 3x larger than the perplexity drop. For agent and tool-use workloads, do not go below Q4.
Speed and Memory Savings
| Quantization | Memory (Llama 70B) | tps speedup vs FP16 (M5 Max) |
|---|---|---|
| FP16 | ~140GB | 1.0x (baseline, often OOM) |
| Q8_0 | ~70GB | ~1.8x |
| Q5_K_M | ~48GB | ~3.0x |
| Q4_K_M | ~40GB | ~3.7x |
| AWQ 4-bit | ~38GB | ~3.8x |
| Q3_K_M | ~30GB | ~4.5x |
Format Recommendations by Use Case
- Production agents and tool-use: AWQ 4-bit or GGUF Q5_K_M. Reasoning quality matters; the 1-3 percent extra memory is worth it.
- Single-user chat: GGUF Q4_K_M. Best speed/quality/portability balance, widely available.
- Apple Silicon: MLX 4-bit if available; fall back to GGUF Q4_K_M.
- Memory-constrained (consumer 8-16GB): Q3_K_M for non-reasoning tasks only; never below Q3.
- Quality-critical evaluation: Q8_0 or FP16; the 2x memory cost is small at this end of the spectrum.
Brand Visibility Implications
Quantization affects what models recommend at the margin. Reasoning benchmarks degrade meaningfully at Q3 and below, and reasoning models are increasingly making brand-recommendation decisions in agent flows. Brands evaluating their AI visibility should be aware that the same model run at Q4 versus Q2 may produce materially different brand inclusions because the model's ability to retrieve and reason over training data degrades at low quantization. For agent-driven local deployments, Q4 minimum is the operational requirement.
Methodology
Perplexity figures aggregated from llama.cpp Discussions quality threads, the AutoAWQ repo, and Hugging Face quantization blog posts. GSM8K accuracy from public model cards on Hugging Face for representative quantised variants. Memory and speedup from MLX and llama.cpp benchmark threads. Real runs vary, treat as guidance. Updated quarterly.
How Presenc AI Helps
Presenc AI's deployment-side telemetry distinguishes brand-mention rates across quantization tiers, surfacing whether a brand's recommendation rate degrades when a deployment moves from Q5 to Q3 to save memory. For enterprises optimising local LLM deployments, this is the only operational signal that connects quantization decisions to brand-visibility outcomes.