Research

Qwen 3.5 Quantization Speed Benchmarks 2026

Tokens-per-second benchmarks for Qwen 3.5 at every GGUF quantization tier (Q2 through Q8) on Apple Silicon M5 Max, NVIDIA RTX 3090/4090, and consumer hardware. Memory footprint, speed, and context-decay tradeoffs.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

How Fast Does Qwen 3.5 Actually Run at Each Quantization?

Qwen 3.5 (Alibaba) is the most-downloaded open-weight LLM family on Hugging Face in 2026, and most production deployments run quantised variants for memory and speed reasons. This page consolidates tokens-per-second (tps) benchmarks for Qwen 3.5 across the GGUF quantization tiers Q2 through Q8, on three reference hardware classes: Apple Silicon M5 Max, NVIDIA RTX 3090/4090, and consumer-grade 16GB systems. Quality tradeoffs by quantization are covered on our companion Local LLM Quantization Quality Benchmarks page.

Tokens-per-Second by Quantization (Qwen 3.5 32B on Apple M5 Max, llama.cpp)

QuantizationBitsMemorytps (10K context)tps (120K context)
FP16 / BF1616~64 GB~18 tpsOOM or fallback
Q8_08~34 GB~52 tps~38 tps
Q6_K6.5~27 GB~78 tps~56 tps
Q5_K_M5.5~23 GB~95 tps~65 tps
UD-Q5_K_XL (Unsloth)5.5~24 GB~75 tps~65 tps
Q4_K_M4.5~19 GB~115 tps~78 tps
AWQ 4-bit4.0~18 GB~120 tps~82 tps
Q3_K_M3.5~14 GB~136 tps~92 tps
Q2_K2.5~10 GB~152 tps~100 tps

Pattern: tps roughly doubles from Q8 to Q3 at short context, and the context-length penalty (10K vs 120K) reduces tps by ~30 percent across all quantization tiers. The Q4 tier hits the practical sweet spot of speed (~115 tps) without the quality degradation that begins below Q4.

Tokens-per-Second on NVIDIA Hardware

QuantizationRTX 3090 (24GB)RTX 4090 (24GB)
Q8_0~75 tps~110 tps
Q5_K_M~120 tps~165 tps
Q4_K_M~140 tps~190 tps
AWQ 4-bit~155 tps~210 tps
Q3_K_M (Unsloth)~165 tps~230 tps

Memory Footprint (Qwen 3.5 Family, Q4_K_M reference)

ModelParametersQ4_K_M MemoryRecommended Hardware
Qwen 3.5 0.5B Instruct0.5B~0.5 GBAny mobile / Raspberry Pi class
Qwen 3.5 1.5B Instruct1.5B~1.2 GB4GB RAM phone / Pi
Qwen 3.5 3B Instruct3B~2.3 GB8GB consumer laptop
Qwen 3.5 7B Instruct7B~4.5 GB16GB consumer laptop / M-series Mac
Qwen 3.5 14B14B~9 GB16-24 GB workstation
Qwen 3.5 32B32B~19 GB24-32 GB GPU / M2 Max+
Qwen 3.5 Max (MoE, 235B params, ~22B active)235B nominal~140 GBMulti-GPU / H100 / M3 Ultra

Five Things the Benchmarks Tell You

  1. Q4_K_M is the production sweet spot for Qwen 3.5 32B. ~115 tps on M5 Max with ~19 GB memory and 1-2 percent quality degradation versus FP16. Below Q4 (Q3, Q2), tps continues to climb but reasoning benchmarks degrade meaningfully per our quality-benchmarks page.
  2. Context length costs ~30 percent of tps consistently. The drop from 10K context to 120K context is roughly proportional across all quantization tiers. Plan capacity around long-context tps, not short-context tps, for retrieval-augmented or agentic workloads.
  3. Apple M5 Max competes with desktop NVIDIA at Q4. M5 Max Q4_K_M at ~115 tps is within 20 percent of RTX 3090 Q4_K_M at ~140 tps, despite the GPU being a dedicated 350W card vs the M5 Max's ~80W unified-memory system-on-chip. For Mac-native developers, the practical performance gap is smaller than the price gap.
  4. UD-Q5_K_XL (Unsloth's extended Q5) does not consistently beat Q5_K_M. The Unsloth variant uses slightly more memory (~24 GB vs ~23 GB) and runs slower at short context (~75 tps vs ~95 tps), trading speed for marginally better quality on certain prompts. Worth testing in your specific workload but not the default choice.
  5. The Q3 quality wall is sharper for Qwen 3.5 than for Llama. Qwen 3.5 32B at Q3_K_M loses noticeable reasoning accuracy on GSM8K and HumanEval that Llama 4 8B at the same bit-width preserves better. The implication: for Qwen-3.5 specifically, do not go below Q4 for agent or tool-use workloads, even though the speed advantage is tempting.

What This Means for AI Visibility

Qwen 3.5 is increasingly the substrate of local-deployment agents and RAG systems, particularly in Chinese-language and bilingual workloads. Brand-visibility programmes that test against cloud APIs only and ignore local Qwen deployments are missing a large and growing surface, particularly in APAC markets. The quantization tier matters because lower-bit Qwen variants degrade reasoning quality faster, which changes how the model handles brand-comparison prompts. For brands testing AI visibility across deployment scenarios, Q4_K_M is the right default test target.

Methodology

Benchmark data aggregated on May 14, 2026 from public sources: Qwen official speed benchmarks, Unsloth Qwen 3.5 documentation, community llama.cpp benchmarking threads on GitHub, and reproducible community benchmarks on Apple Silicon (including Ollama vs llama.cpp vs MLX comparisons). All figures are llama.cpp 0.39+ with batch size 1. Real-world tps varies with prompt characteristics; treat tabulated figures as guidance, run benchmarks on your hardware-prompt combination for production sizing.

How Presenc AI Helps

Presenc AI tracks brand-mention rates across Qwen-powered deployments alongside cloud-API surfaces. When a brand's mention rate diverges between cloud Qwen and local Qwen at Q4, the gap often signals that quantization is dropping low-confidence brand recall. For brands with substantial APAC or developer-tooling exposure, the local-Qwen surface is now a meaningful component of total brand visibility.

Frequently Asked Questions

Q2_K at approximately 152 tps for short context (10K) and 100 tps at long context (120K). However, Q2 produces noticeable reasoning quality loss; for production use the speed sweet spot is Q4_K_M at approximately 115 tps with minimal quality penalty.
Q4_K_M for most use cases. It runs at 115 tps on M5 Max and 140 tps on RTX 3090, uses approximately 19 GB memory, and shows only 1-2 percent quality degradation versus FP16 on reasoning benchmarks. AWQ 4-bit is comparable on NVIDIA hardware if you have the toolchain. Avoid going below Q4 for agent or tool-use workloads.
Roughly 30 percent reduction in tps from 10K to 120K context, consistently across quantization tiers. Q4_K_M drops from 115 to 78 tps; Q5_K_M drops from 95 to 65 tps. Plan capacity around your typical context length, not the model's maximum capacity.
No, not at FP16 or even Q4_K_M. Qwen 3.5 Max has 235B parameters with mixture-of-experts routing (~22B active per inference). At Q4_K_M memory is approximately 140 GB, beyond any single consumer or workstation GPU. Multi-GPU deployment, H100 80GB, or Apple M3 Ultra unified memory is required.
Sometimes. UD-Q5_K_XL trades short-context speed (~75 tps vs Q5_K_M's 95 tps) for marginally better quality on certain prompts. For prompt distributions emphasising fine-grained accuracy on edge cases, test against your workload. For general-purpose use, Q5_K_M or Q4_K_M are simpler defaults.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.