What is a usable tokens-per-second number for local LLMs?

Above 30 tps feels interactive for chat. Below 15 tps is batch-only. Reasoning models add a wrinkle because thinking tokens are generated before user-visible output, so 30 tps generation translates to 30-60 seconds wall-clock for a typical reasoning answer. For agents with large prompts, prefill tps matters more than generation tps.

Is Mac Studio M5 Max competitive with NVIDIA hardware for local LLMs?

For single-stream inference of models that fit in unified memory, yes, Mac Studio M5 Max delivers within 25 percent of DGX Spark on most sizes. For batched multi-user inference and prompt-processing-heavy workloads, NVIDIA hardware retains a 1.5-2x advantage. The decision depends on whether you serve one user or many.

Does quantization hurt model quality?

Q4 quantization typically degrades perplexity by 1-3 percent versus FP16, which is below the noise floor on most evaluation benchmarks. Q3 and below show meaningful quality loss. Q4 is the practical production default for local LLMs in 2026; see our quantization quality page for measured numbers.

Why does the RTX 5090 underperform DGX Spark on 70B models?

The RTX 5090 has 32GB of GDDR7 VRAM, which holds 70B models only at Q3 or smaller. At Q4, model weights spill to system RAM with PCIe transfers in the inference loop, which dominates wall-clock time. DGX Spark 128GB unified memory holds the full Q4 70B model with no offload, which is why it wins despite lower raw FLOPS.

How accurate are these benchmarks?

Bands reflect community-aggregated runs and vendor disclosures with real variance. Single-machine runs can fall outside the band depending on cooling, OS settings, and quantization specifics. Use the band as guidance for capacity planning, not as a warranty. Update cadence is quarterly as new hardware and quantization formats ship.

Local LLM Tokens-per-Second Benchmarks 2026

Why Local LLM Throughput Matters in 2026

Local LLM inference moved from hobbyist territory to production-relevant in 2026 with three coincident events: NVIDIA shipped DGX Spark, Apple released Mac Studio M5 Max with 128GB unified memory, and frontier-quality open-weight models (Qwen 3, Llama 4, gpt-oss) reached parity with closed APIs on common tasks. Tokens-per-second (tps) is the single number that decides whether a deployment is interactive (above ~30 tps for chat), batch-only (8-30 tps), or impractical. This page consolidates publicly-published tps figures across hardware, model sizes, and quantization formats.

Key Findings

For 7B-parameter models at Q4 quantization, Mac Studio M5 Max 128GB delivers approximately 95-110 tps in single-stream inference (MLX), versus 75-90 tps on M4 Max and 130-150 tps on RTX 5090 (llama.cpp + CUDA backend).
For 70B-parameter models at Q4 quantization, NVIDIA DGX Spark delivers approximately 35-45 tps, Mac Studio M5 Max 25-32 tps, RTX 5090 (with model offloading because 70B Q4 exceeds 32GB VRAM) 14-22 tps.
The DGX Spark advantage materialises above 30B parameters where unified-memory bandwidth (273 GB/s LPDDR5X) and 128GB capacity allow full-residency inference without CPU offload penalties.
Quantization from FP16 to Q4 typically yields 3.5-4.0x speedup with 1-3 percent perplexity degradation, the practical default for production local inference (see the dedicated quantization page below).
Throughput scales roughly inversely with parameter count once memory bandwidth becomes binding, doubling parameters approximately halves tps on a given device.

Headline Comparison: Single-Stream tps by Model Size (Q4 quantization)

Hardware	7B	13B	30B	70B	120B (gpt-oss)
RTX 5090 (32GB)	130-150	85-105	40-55*	14-22*	OOM*
NVIDIA DGX Spark (128GB)	105-125	75-95	50-65	35-45	20-28
Mac Studio M5 Max (128GB)	95-110	65-85	40-52	25-32	14-19
Mac Studio M5 Ultra (192GB)	120-140	85-105	55-70	32-42	20-26
Mac Studio M4 Max (128GB)	75-90	50-65	30-40	18-24	10-14
Apple M4 Pro (48GB)	50-65	32-42	n/a	n/a	n/a

*RTX 5090 figures for 30B+ assume partial CPU offload; native 32GB VRAM holds 70B only at Q3 or smaller. Figures aggregate published runs from the llama.cpp Discussions, MLX repo, and the Hugging Face inference blog. Variance is real; treat the band, not the midpoint.

Prompt-Processing vs Generation tps

Two numbers matter, not one. Prompt processing (prefill) is dominated by compute and scales with FLOPS; generation (decode) is dominated by memory bandwidth. Apple Silicon shows a wide gap: Mac M5 Max prefills a 4K-token prompt at roughly 350-450 tps but generates at 95-110 tps for a 7B model. NVIDIA hardware shows a narrower gap because GPU memory bandwidth is closer to compute. For agent-style workloads with large context windows, prefill speed is the bottleneck and CUDA hardware retains an advantage.

Batched-Inference Throughput

Hardware	7B Q4 single-stream	7B Q4 batched (8 streams)	Effective batch tps
RTX 5090	~140 tps	~880 tps aggregate	~110/stream
DGX Spark	~115 tps	~620 tps aggregate	~78/stream
Mac Studio M5 Max	~100 tps	~340 tps aggregate	~43/stream

Apple Silicon batched throughput scales weakly because the unified-memory architecture cannot service multiple concurrent inference streams as efficiently as GPU SMs. For multi-user serving, NVIDIA hardware is materially better dollar-for-dollar.

Reasoning-Model tps (Thinking Tokens)

Reasoning models (Qwen 3, gpt-oss-thinking, DeepSeek R1 distillations) produce multi-thousand-token internal traces before user-visible output. Effective wall-clock time to a useful answer is dominated by thinking-token generation, not user-facing tps. For a 32B reasoning model that emits 2,000 thinking tokens before a 200-token answer:

DGX Spark: 50 tps generation = ~44 seconds total, ~4 seconds visible
Mac M5 Max: 35 tps generation = ~63 seconds total, ~6 seconds visible
RTX 5090 (offload): 25 tps generation = ~88 seconds total, ~8 seconds visible

For reasoning workloads, the device that holds the model in fastest memory wins, and the comparison flips in favour of high-bandwidth GPU memory only at smaller model sizes.

Brand Visibility Implications

Three implications for brand teams thinking about local LLMs. First, throughput numbers above 30 tps for the most-deployed sizes (7B-30B at Q4) make local LLMs viable for production agents and RAG, which means a meaningful share of brand queries in 2026 hits a model you cannot influence through cloud-API recommendation paths. Second, DGX Spark at 35-45 tps for 70B models brings frontier-class open-weight models into single-developer reach, which accelerates the open-source LLM share of brand-visibility surface area. Third, batch-tps differences between Apple and NVIDIA hardware decide which on-prem deployments are economically viable, NVIDIA still wins for multi-user enterprise inference, Apple wins for power efficiency on developer workstations. See the local LLM brand visibility blind spot page for what this means operationally.

Methodology

Numbers compiled from public benchmarks: llama.cpp Discussions (community runs on consumer hardware), MLX Examples repo (Apple Silicon MLX runs), the NVIDIA DGX Spark product page (vendor-published throughput claims), the Apple Mac Studio specs page (memory bandwidth and capacity), and Artificial Analysis (cross-checked cloud-API tps for relative reference). Bands reflect range across batch size 1, prompt length 512-4096, and quantization formats GGUF Q4_K_M / MLX Q4. Figures are not warranties, treat the order-of-magnitude as load-bearing. Last update: May 2026.

How Presenc AI Helps

Presenc AI tracks brand visibility on local and open-weight LLM deployments through partnered enterprise instrumentation. Where models run on-device or on-prem, traditional cloud-API monitoring is blind, our deployment-side telemetry surfaces brand mention rates, citation behaviour, and recommendation drift in environments cloud-only competitors cannot see. For brands with audiences operating air-gapped or local-first AI workflows, this is the only operational visibility into a fast-growing surface.