How Fast Does Qwen 3.5 Actually Run at Each Quantization?
Qwen 3.5 (Alibaba) is the most-downloaded open-weight LLM family on Hugging Face in 2026, and most production deployments run quantised variants for memory and speed reasons. This page consolidates tokens-per-second (tps) benchmarks for Qwen 3.5 across the GGUF quantization tiers Q2 through Q8, on three reference hardware classes: Apple Silicon M5 Max, NVIDIA RTX 3090/4090, and consumer-grade 16GB systems. Quality tradeoffs by quantization are covered on our companion Local LLM Quantization Quality Benchmarks page.
Tokens-per-Second by Quantization (Qwen 3.5 32B on Apple M5 Max, llama.cpp)
| Quantization | Bits | Memory | tps (10K context) | tps (120K context) |
|---|---|---|---|---|
| FP16 / BF16 | 16 | ~64 GB | ~18 tps | OOM or fallback |
| Q8_0 | 8 | ~34 GB | ~52 tps | ~38 tps |
| Q6_K | 6.5 | ~27 GB | ~78 tps | ~56 tps |
| Q5_K_M | 5.5 | ~23 GB | ~95 tps | ~65 tps |
| UD-Q5_K_XL (Unsloth) | 5.5 | ~24 GB | ~75 tps | ~65 tps |
| Q4_K_M | 4.5 | ~19 GB | ~115 tps | ~78 tps |
| AWQ 4-bit | 4.0 | ~18 GB | ~120 tps | ~82 tps |
| Q3_K_M | 3.5 | ~14 GB | ~136 tps | ~92 tps |
| Q2_K | 2.5 | ~10 GB | ~152 tps | ~100 tps |
Pattern: tps roughly doubles from Q8 to Q3 at short context, and the context-length penalty (10K vs 120K) reduces tps by ~30 percent across all quantization tiers. The Q4 tier hits the practical sweet spot of speed (~115 tps) without the quality degradation that begins below Q4.
Tokens-per-Second on NVIDIA Hardware
| Quantization | RTX 3090 (24GB) | RTX 4090 (24GB) |
|---|---|---|
| Q8_0 | ~75 tps | ~110 tps |
| Q5_K_M | ~120 tps | ~165 tps |
| Q4_K_M | ~140 tps | ~190 tps |
| AWQ 4-bit | ~155 tps | ~210 tps |
| Q3_K_M (Unsloth) | ~165 tps | ~230 tps |
Memory Footprint (Qwen 3.5 Family, Q4_K_M reference)
| Model | Parameters | Q4_K_M Memory | Recommended Hardware |
|---|---|---|---|
| Qwen 3.5 0.5B Instruct | 0.5B | ~0.5 GB | Any mobile / Raspberry Pi class |
| Qwen 3.5 1.5B Instruct | 1.5B | ~1.2 GB | 4GB RAM phone / Pi |
| Qwen 3.5 3B Instruct | 3B | ~2.3 GB | 8GB consumer laptop |
| Qwen 3.5 7B Instruct | 7B | ~4.5 GB | 16GB consumer laptop / M-series Mac |
| Qwen 3.5 14B | 14B | ~9 GB | 16-24 GB workstation |
| Qwen 3.5 32B | 32B | ~19 GB | 24-32 GB GPU / M2 Max+ |
| Qwen 3.5 Max (MoE, 235B params, ~22B active) | 235B nominal | ~140 GB | Multi-GPU / H100 / M3 Ultra |
Five Things the Benchmarks Tell You
- Q4_K_M is the production sweet spot for Qwen 3.5 32B. ~115 tps on M5 Max with ~19 GB memory and 1-2 percent quality degradation versus FP16. Below Q4 (Q3, Q2), tps continues to climb but reasoning benchmarks degrade meaningfully per our quality-benchmarks page.
- Context length costs ~30 percent of tps consistently. The drop from 10K context to 120K context is roughly proportional across all quantization tiers. Plan capacity around long-context tps, not short-context tps, for retrieval-augmented or agentic workloads.
- Apple M5 Max competes with desktop NVIDIA at Q4. M5 Max Q4_K_M at ~115 tps is within 20 percent of RTX 3090 Q4_K_M at ~140 tps, despite the GPU being a dedicated 350W card vs the M5 Max's ~80W unified-memory system-on-chip. For Mac-native developers, the practical performance gap is smaller than the price gap.
- UD-Q5_K_XL (Unsloth's extended Q5) does not consistently beat Q5_K_M. The Unsloth variant uses slightly more memory (~24 GB vs ~23 GB) and runs slower at short context (~75 tps vs ~95 tps), trading speed for marginally better quality on certain prompts. Worth testing in your specific workload but not the default choice.
- The Q3 quality wall is sharper for Qwen 3.5 than for Llama. Qwen 3.5 32B at Q3_K_M loses noticeable reasoning accuracy on GSM8K and HumanEval that Llama 4 8B at the same bit-width preserves better. The implication: for Qwen-3.5 specifically, do not go below Q4 for agent or tool-use workloads, even though the speed advantage is tempting.
What This Means for AI Visibility
Qwen 3.5 is increasingly the substrate of local-deployment agents and RAG systems, particularly in Chinese-language and bilingual workloads. Brand-visibility programmes that test against cloud APIs only and ignore local Qwen deployments are missing a large and growing surface, particularly in APAC markets. The quantization tier matters because lower-bit Qwen variants degrade reasoning quality faster, which changes how the model handles brand-comparison prompts. For brands testing AI visibility across deployment scenarios, Q4_K_M is the right default test target.
Methodology
Benchmark data aggregated on May 14, 2026 from public sources: Qwen official speed benchmarks, Unsloth Qwen 3.5 documentation, community llama.cpp benchmarking threads on GitHub, and reproducible community benchmarks on Apple Silicon (including Ollama vs llama.cpp vs MLX comparisons). All figures are llama.cpp 0.39+ with batch size 1. Real-world tps varies with prompt characteristics; treat tabulated figures as guidance, run benchmarks on your hardware-prompt combination for production sizing.
How Presenc AI Helps
Presenc AI tracks brand-mention rates across Qwen-powered deployments alongside cloud-API surfaces. When a brand's mention rate diverges between cloud Qwen and local Qwen at Q4, the gap often signals that quantization is dropping low-confidence brand recall. For brands with substantial APAC or developer-tooling exposure, the local-Qwen surface is now a meaningful component of total brand visibility.