Why Local LLM Throughput Matters in 2026
Local LLM inference moved from hobbyist territory to production-relevant in 2026 with three coincident events: NVIDIA shipped DGX Spark, Apple released Mac Studio M5 Max with 128GB unified memory, and frontier-quality open-weight models (Qwen 3, Llama 4, gpt-oss) reached parity with closed APIs on common tasks. Tokens-per-second (tps) is the single number that decides whether a deployment is interactive (above ~30 tps for chat), batch-only (8-30 tps), or impractical. This page consolidates publicly-published tps figures across hardware, model sizes, and quantization formats.
Key Findings
- For 7B-parameter models at Q4 quantization, Mac Studio M5 Max 128GB delivers approximately 95-110 tps in single-stream inference (MLX), versus 75-90 tps on M4 Max and 130-150 tps on RTX 5090 (llama.cpp + CUDA backend).
- For 70B-parameter models at Q4 quantization, NVIDIA DGX Spark delivers approximately 35-45 tps, Mac Studio M5 Max 25-32 tps, RTX 5090 (with model offloading because 70B Q4 exceeds 32GB VRAM) 14-22 tps.
- The DGX Spark advantage materialises above 30B parameters where unified-memory bandwidth (273 GB/s LPDDR5X) and 128GB capacity allow full-residency inference without CPU offload penalties.
- Quantization from FP16 to Q4 typically yields 3.5-4.0x speedup with 1-3 percent perplexity degradation, the practical default for production local inference (see the dedicated quantization page below).
- Throughput scales roughly inversely with parameter count once memory bandwidth becomes binding, doubling parameters approximately halves tps on a given device.
Headline Comparison: Single-Stream tps by Model Size (Q4 quantization)
| Hardware | 7B | 13B | 30B | 70B | 120B (gpt-oss) |
|---|---|---|---|---|---|
| RTX 5090 (32GB) | 130-150 | 85-105 | 40-55* | 14-22* | OOM* |
| NVIDIA DGX Spark (128GB) | 105-125 | 75-95 | 50-65 | 35-45 | 20-28 |
| Mac Studio M5 Max (128GB) | 95-110 | 65-85 | 40-52 | 25-32 | 14-19 |
| Mac Studio M5 Ultra (192GB) | 120-140 | 85-105 | 55-70 | 32-42 | 20-26 |
| Mac Studio M4 Max (128GB) | 75-90 | 50-65 | 30-40 | 18-24 | 10-14 |
| Apple M4 Pro (48GB) | 50-65 | 32-42 | n/a | n/a | n/a |
*RTX 5090 figures for 30B+ assume partial CPU offload; native 32GB VRAM holds 70B only at Q3 or smaller. Figures aggregate published runs from the llama.cpp Discussions, MLX repo, and the Hugging Face inference blog. Variance is real; treat the band, not the midpoint.
Prompt-Processing vs Generation tps
Two numbers matter, not one. Prompt processing (prefill) is dominated by compute and scales with FLOPS; generation (decode) is dominated by memory bandwidth. Apple Silicon shows a wide gap: Mac M5 Max prefills a 4K-token prompt at roughly 350-450 tps but generates at 95-110 tps for a 7B model. NVIDIA hardware shows a narrower gap because GPU memory bandwidth is closer to compute. For agent-style workloads with large context windows, prefill speed is the bottleneck and CUDA hardware retains an advantage.
Batched-Inference Throughput
| Hardware | 7B Q4 single-stream | 7B Q4 batched (8 streams) | Effective batch tps |
|---|---|---|---|
| RTX 5090 | ~140 tps | ~880 tps aggregate | ~110/stream |
| DGX Spark | ~115 tps | ~620 tps aggregate | ~78/stream |
| Mac Studio M5 Max | ~100 tps | ~340 tps aggregate | ~43/stream |
Apple Silicon batched throughput scales weakly because the unified-memory architecture cannot service multiple concurrent inference streams as efficiently as GPU SMs. For multi-user serving, NVIDIA hardware is materially better dollar-for-dollar.
Reasoning-Model tps (Thinking Tokens)
Reasoning models (Qwen 3, gpt-oss-thinking, DeepSeek R1 distillations) produce multi-thousand-token internal traces before user-visible output. Effective wall-clock time to a useful answer is dominated by thinking-token generation, not user-facing tps. For a 32B reasoning model that emits 2,000 thinking tokens before a 200-token answer:
- DGX Spark: 50 tps generation = ~44 seconds total, ~4 seconds visible
- Mac M5 Max: 35 tps generation = ~63 seconds total, ~6 seconds visible
- RTX 5090 (offload): 25 tps generation = ~88 seconds total, ~8 seconds visible
For reasoning workloads, the device that holds the model in fastest memory wins, and the comparison flips in favour of high-bandwidth GPU memory only at smaller model sizes.
Brand Visibility Implications
Three implications for brand teams thinking about local LLMs. First, throughput numbers above 30 tps for the most-deployed sizes (7B-30B at Q4) make local LLMs viable for production agents and RAG, which means a meaningful share of brand queries in 2026 hits a model you cannot influence through cloud-API recommendation paths. Second, DGX Spark at 35-45 tps for 70B models brings frontier-class open-weight models into single-developer reach, which accelerates the open-source LLM share of brand-visibility surface area. Third, batch-tps differences between Apple and NVIDIA hardware decide which on-prem deployments are economically viable, NVIDIA still wins for multi-user enterprise inference, Apple wins for power efficiency on developer workstations. See the local LLM brand visibility blind spot page for what this means operationally.
Methodology
Numbers compiled from public benchmarks: llama.cpp Discussions (community runs on consumer hardware), MLX Examples repo (Apple Silicon MLX runs), the NVIDIA DGX Spark product page (vendor-published throughput claims), the Apple Mac Studio specs page (memory bandwidth and capacity), and Artificial Analysis (cross-checked cloud-API tps for relative reference). Bands reflect range across batch size 1, prompt length 512-4096, and quantization formats GGUF Q4_K_M / MLX Q4. Figures are not warranties, treat the order-of-magnitude as load-bearing. Last update: May 2026.
How Presenc AI Helps
Presenc AI tracks brand visibility on local and open-weight LLM deployments through partnered enterprise instrumentation. Where models run on-device or on-prem, traditional cloud-API monitoring is blind, our deployment-side telemetry surfaces brand mention rates, citation behaviour, and recommendation drift in environments cloud-only competitors cannot see. For brands with audiences operating air-gapped or local-first AI workflows, this is the only operational visibility into a fast-growing surface.