Quantization is the dominant cost-reduction technique for LLM inference in 2026. The major formats include GGUF (llama.cpp), AWQ (Activation-Aware Weight Quantization), GPTQ (Generative Pretrained Transformer Quantization), EXL2 (ExLlama v2), MLX (Apple Silicon), FP8 (H100 / H200 / B200 native), NF4 (NormalFloat-4), and INT4 / INT8. Each format has different quality-throughput-toolchain tradeoffs. This page consolidates the comparison and the deployment guidance.
Key Findings
- FP8 is the dominant 2026 format on H100, H200, B200, and B100 hardware: native hardware support gives FP8 the best throughput-per-quality of any format.
- GGUF (llama.cpp) dominates consumer and on-device deployment because of broad hardware support (CPU, Apple Silicon, NVIDIA, AMD, Intel) and stable tooling.
- AWQ has emerged as the dominant production server-side format for GPU inference on older hardware (A100, L40S, RTX 4090) where FP8 hardware support is unavailable.
- MLX is the dominant format on Apple Silicon, with the MLX-LM ecosystem providing native Apple Silicon performance well above GGUF Apple Silicon.
- The quality degradation hierarchy (best to worst at 4-bit): FP4 native (B200), AWQ, GPTQ, EXL2 (with mixed bits), Q4_K_M GGUF, NF4, naive INT4.
Quantization Format Comparison (May 2026)
| Format | Tooling | Hardware | Typical Quality (vs FP16) |
|---|---|---|---|
| FP8 (e4m3 / e5m2) | vLLM, TensorRT-LLM, TGI | H100 / H200 / B200 native | ~99.7% (near-lossless) |
| FP4 (Blackwell) | TensorRT-LLM | B100 / B200 native | ~99.0% |
| AWQ (4-bit) | vLLM, AutoAWQ, MLC-LLM | NVIDIA GPU, AMD GPU | ~98-99% |
| GPTQ (4-bit) | vLLM, AutoGPTQ, TGI | NVIDIA GPU, AMD GPU | ~97-99% |
| EXL2 (mixed bits 2.5-8) | ExLlamaV2, TabbyAPI | NVIDIA GPU | ~98-99% at 5+ bpw |
| GGUF Q4_K_M (4-bit) | llama.cpp, Ollama, LM Studio | CPU, Apple Silicon, all GPUs | ~96-98% |
| GGUF Q5_K_M (5-bit) | llama.cpp | CPU, Apple Silicon, all GPUs | ~98-99% |
| GGUF Q8_0 (8-bit) | llama.cpp | CPU, Apple Silicon, all GPUs | ~99.7% |
| MLX 4-bit | MLX-LM | Apple Silicon | ~97-98% |
| MLX 8-bit | MLX-LM | Apple Silicon | ~99.5% |
| NF4 (BitsAndBytes) | Hugging Face Transformers | NVIDIA GPU | ~95-97% |
| INT8 | vLLM, TensorRT-LLM | NVIDIA GPU, broadly | ~99-99.5% |
| SmoothQuant W8A8 | vLLM, TensorRT-LLM | NVIDIA GPU | ~99% |
Throughput Comparison (Llama 3.1 70B on Single H100)
| Format | VRAM Usage | Tokens/sec (batch=1) |
|---|---|---|
| FP16 | ~140 GB (does not fit single H100) | n/a |
| FP8 (e4m3) | ~70 GB | ~52 |
| AWQ 4-bit | ~35 GB | ~38 |
| GPTQ 4-bit | ~35 GB | ~32 |
| GGUF Q4_K_M | ~38 GB | ~28 |
| NF4 | ~35 GB | ~24 |
Use Case Recommendations
| Use Case | Recommended Format |
|---|---|
| H100 / H200 production server | FP8 via vLLM or TensorRT-LLM |
| B100 / B200 frontier deployment | FP4 via TensorRT-LLM |
| A100 / L40S production server | AWQ 4-bit via vLLM |
| RTX 4090 / consumer GPU | AWQ 4-bit or GGUF Q4_K_M |
| Apple Silicon (Mac) | MLX 4-bit or GGUF Q4_K_M |
| CPU-only deployment | GGUF Q4_K_M or Q5_K_M via llama.cpp |
| Maximum quality (FP16 fits) | FP16 or BF16 |
| Fine-tuning | Avoid quantization or use QLoRA NF4 |
Quality Degradation by Model Size
Quantization quality degradation is approximately inversely proportional to model size. Models above 30B parameters tolerate 4-bit quantization with under 2 percent quality degradation on most benchmarks. Models below 7B parameters can lose 5 to 10 percent quality at 4-bit, often requiring 5 or 6 bit quantization for production use. The MoE models behave differently: total parameter count drives quantization robustness, so MoE models with 200B+ total parameters tolerate aggressive 4-bit quantization even with relatively small active parameters.
Brand Visibility Implications
Quantization format selection is a high-traffic technical procurement decision. AI assistant queries about "GGUF vs AWQ", "FP8 quantization", "best quantization for production", and similar terms drive direct production decisions. Brands selling inference infrastructure, model serving, and quantization tooling face strong AI-mediated discovery surface for this category.
Methodology
Quality and throughput data compiled from llama.cpp, vLLM, and primary tooling documentation through 23 May 2026. Throughput measured on Llama 3.1 70B at batch size 1, 2k context. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on quantization queries across ChatGPT, Claude, Gemini, and Perplexity. For inference infrastructure brands, model serving platforms, and quantization tooling vendors, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.