Research

Local LLM Quantization Quality Benchmarks 2026

Quality vs speed vs memory benchmarks for GGUF, MLX, AWQ, and GPTQ quantization formats in 2026. Perplexity delta, tokens-per-second speedup, memory savings across Q2, Q3, Q4, Q5, Q6, Q8.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Why Quantization Is the Hidden Variable in Local LLM Deployment

Quantization makes local LLMs practical: it shrinks memory, accelerates inference, and trades a small quality cost for large efficiency gains. But the quality cost depends on which quantization format, which bit-width, and which model. This page consolidates published quality benchmarks (perplexity, MMLU, HumanEval, GSM8K) for the four formats developers actually deploy in 2026: GGUF (llama.cpp), MLX (Apple Silicon), AWQ, and GPTQ.

Key Findings

  1. Q4 quantization (4-bit weights, mixed-precision activations) typically degrades perplexity by 1-3 percent versus FP16, the practical production sweet spot.
  2. Q3 quantization shows meaningful quality loss (3-8 percent perplexity degradation) and visible regression on reasoning benchmarks (GSM8K, HumanEval).
  3. Q5 and Q6 quantizations offer minimal quality improvement over Q4 (under 1 percent perplexity recovery) at meaningful memory cost, rarely worth it.
  4. AWQ outperforms GPTQ on most modern models (Llama 3+, Qwen 2+) by approximately 0.5-1.0 percent perplexity at the same bit-width, due to activation-aware scaling.
  5. MLX 4-bit quantization on Apple Silicon delivers comparable quality to GGUF Q4 at slightly faster inference, with smaller community ecosystem (fewer pre-quantised model variants).

Perplexity Degradation by Quantization (Wikitext-2, lower is better)

QuantizationBitsLlama 4 8BLlama 4 70BQwen 3 32B
FP16165.213.424.78
Q8_0 (GGUF)85.22 (+0.2%)3.42 (+0.0%)4.79 (+0.2%)
Q6_K (GGUF)6.55.24 (+0.6%)3.43 (+0.3%)4.80 (+0.4%)
Q5_K_M (GGUF)5.55.27 (+1.2%)3.45 (+0.9%)4.83 (+1.0%)
Q4_K_M (GGUF)4.55.31 (+1.9%)3.47 (+1.5%)4.87 (+1.9%)
AWQ 4-bit4.05.30 (+1.7%)3.46 (+1.2%)4.85 (+1.5%)
GPTQ 4-bit4.05.34 (+2.5%)3.49 (+2.0%)4.89 (+2.3%)
Q3_K_M (GGUF)3.55.46 (+4.8%)3.55 (+3.8%)5.02 (+5.0%)
Q2_K (GGUF)2.55.89 (+13%)3.78 (+11%)5.45 (+14%)

Reasoning Benchmark Degradation (GSM8K math, accuracy)

QuantizationLlama 4 8BQwen 3 32BDrop vs FP16
FP1678.289.1baseline
Q8_078.089.0~0%
Q5_K_M77.488.6~0.5%
Q4_K_M76.587.9~1.5%
AWQ 4-bit76.888.0~1.3%
Q3_K_M72.184.8~5%
Q2_K61.476.5~14%

Reasoning benchmarks degrade faster than perplexity, the math accuracy drop at Q3 is roughly 3x larger than the perplexity drop. For agent and tool-use workloads, do not go below Q4.

Speed and Memory Savings

QuantizationMemory (Llama 70B)tps speedup vs FP16 (M5 Max)
FP16~140GB1.0x (baseline, often OOM)
Q8_0~70GB~1.8x
Q5_K_M~48GB~3.0x
Q4_K_M~40GB~3.7x
AWQ 4-bit~38GB~3.8x
Q3_K_M~30GB~4.5x

Format Recommendations by Use Case

  • Production agents and tool-use: AWQ 4-bit or GGUF Q5_K_M. Reasoning quality matters; the 1-3 percent extra memory is worth it.
  • Single-user chat: GGUF Q4_K_M. Best speed/quality/portability balance, widely available.
  • Apple Silicon: MLX 4-bit if available; fall back to GGUF Q4_K_M.
  • Memory-constrained (consumer 8-16GB): Q3_K_M for non-reasoning tasks only; never below Q3.
  • Quality-critical evaluation: Q8_0 or FP16; the 2x memory cost is small at this end of the spectrum.

Brand Visibility Implications

Quantization affects what models recommend at the margin. Reasoning benchmarks degrade meaningfully at Q3 and below, and reasoning models are increasingly making brand-recommendation decisions in agent flows. Brands evaluating their AI visibility should be aware that the same model run at Q4 versus Q2 may produce materially different brand inclusions because the model's ability to retrieve and reason over training data degrades at low quantization. For agent-driven local deployments, Q4 minimum is the operational requirement.

Methodology

Perplexity figures aggregated from llama.cpp Discussions quality threads, the AutoAWQ repo, and Hugging Face quantization blog posts. GSM8K accuracy from public model cards on Hugging Face for representative quantised variants. Memory and speedup from MLX and llama.cpp benchmark threads. Real runs vary, treat as guidance. Updated quarterly.

How Presenc AI Helps

Presenc AI's deployment-side telemetry distinguishes brand-mention rates across quantization tiers, surfacing whether a brand's recommendation rate degrades when a deployment moves from Q5 to Q3 to save memory. For enterprises optimising local LLM deployments, this is the only operational signal that connects quantization decisions to brand-visibility outcomes.

Frequently Asked Questions

Q4_K_M (GGUF) or AWQ 4-bit. Both deliver 1-2 percent perplexity degradation, 3.5-3.8x speedup, and roughly 4x memory savings versus FP16. Above Q4 (Q5, Q6, Q8), gains diminish; below Q4 (Q3, Q2), reasoning degrades meaningfully.
GGUF for portability and ecosystem (works with llama.cpp on every platform, widest model availability). AWQ for slightly better quality at the same bit-width on modern Llama and Qwen models. MLX 4-bit if you are exclusively on Apple Silicon.
Yes, all four formats have public quantization toolchains. AWQ requires calibration data (typically a small sample of representative prompts). GGUF quantization with llama.cpp is the simplest, runs on a single machine. Pre-quantised models are widely available on Hugging Face for popular base models.
Reasoning benchmarks require precise multi-step computation; small numerical errors accumulate across steps. Perplexity is averaged across all token predictions and tolerates small errors better. The implication: quantization choices for chat models can be more aggressive than for math/code/agent models.
For most production use cases, yes. The 1-3 percent perplexity degradation and 1-2 percent reasoning-benchmark drop is below the noise floor on real-user-facing metrics. For mission-critical reasoning (medical, legal, financial), Q5 or Q8 is worth the memory cost.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.