Research

On-Device RAG Performance Benchmarks 2026

Local RAG (retrieval-augmented generation) performance in 2026: index size limits, retrieval latency, queries per second, and end-to-end RAG response time across DGX Spark, Mac Studio, and consumer GPU hardware.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

The Local RAG Stack in 2026

Retrieval-augmented generation moved on-device in 2026 as enterprise concerns about embedding-vector privacy and data residency reached a tipping point. Local RAG combines three components: an embedding model (typically 100M-1B parameters), a vector database (FAISS, Chroma, LanceDB, Qdrant), and a generation LLM (7B-70B). End-to-end performance depends on all three. This page consolidates published benchmarks across hardware tiers.

Key Findings

  1. For 1M-document corpora at 768-dim embeddings, Mac Studio M5 Max sustains approximately 2,400 retrievals per second with FAISS HNSW indexing, comparable to mid-tier server hardware.
  2. End-to-end RAG response time (retrieval + generation) on DGX Spark with a 70B Q4 generator and 1M-doc corpus runs approximately 3.5-5.5 seconds for a typical 200-token answer; on Mac Studio M5 Max, 5-8 seconds.
  3. Embedding generation is the surprise bottleneck for large corpora: indexing 10M documents on a single workstation takes 8-30 hours depending on embedding model and hardware.
  4. Memory-mapped vector indexes scale to 50M+ documents on 128GB unified-memory workstations without tripping into disk-paged retrieval, the practical ceiling for single-machine RAG.
  5. Local RAG quality with frontier-class generation (Llama 4 70B, Qwen 3 32B) is competitive with cloud-API RAG using GPT-4o-class generators, the dominant trade-off in 2026 is latency and indexing throughput, not answer quality.

Retrieval Latency by Hardware (1M-doc corpus, top-10 retrieval, 768-dim embeddings)

HardwareFAISS HNSWFAISS IVF-PQLanceDB
NVIDIA DGX Spark1.2-2.0 ms0.6-1.1 ms1.5-2.5 ms
Mac Studio M5 Max1.5-2.4 ms0.8-1.3 ms1.8-3.0 ms
RTX 5090 build1.0-1.7 ms0.5-0.9 ms1.4-2.2 ms
Mac mini M4 32GB2.2-3.5 ms1.2-1.9 ms2.5-4.0 ms

At sub-5ms retrieval latency on all tiers, retrieval is not the bottleneck for end-to-end response time, generation dominates by orders of magnitude.

Embedding Generation Throughput (BGE-M3 568M-param embedder, batch 32)

HardwareEmbeddings/secondTime to embed 1M docsTime to embed 10M docs
NVIDIA DGX Spark~1,800/s~9 minutes~1.5 hours
RTX 5090~2,400/s~7 minutes~1.2 hours
Mac Studio M5 Max~700/s~24 minutes~4 hours
Mac mini M4 32GB~280/s~60 minutes~10 hours

Embedding throughput skews more strongly toward NVIDIA hardware than inference does, the reason is the small-model batched workload favours GPU compute density.

End-to-End RAG Response Time

HardwareGenerator (Q4)RetrievalGeneration (200 tokens)Total
DGX SparkLlama 4 70B~2 ms~5 seconds~5 seconds
DGX SparkLlama 4 8B~2 ms~1.7 seconds~1.7 seconds
Mac M5 MaxLlama 4 70B~2 ms~7 seconds~7 seconds
Mac M5 MaxLlama 4 8B~2 ms~2.0 seconds~2.0 seconds
RTX 5090Llama 4 8B~1 ms~1.4 seconds~1.4 seconds
Mac mini M4Llama 4 8B Q4~3 ms~3.0 seconds~3.0 seconds

Practical Index-Size Ceilings on 128GB Workstations

FAISS HNSW with 768-dim float32 embeddings consumes roughly 4KB per document plus index overhead. Practical ceilings on 128GB workstations:

  • 50M documents at full HNSW (memory-resident): comfortable
  • 100M documents at HNSW with 8-bit embedding quantization: feasible
  • 500M documents requires IVF-PQ or memory-mapped DiskANN, single-workstation territory ends here
  • Beyond 500M documents: distributed vector DB (Milvus, Qdrant cluster) regardless of workstation memory

Privacy-Preserving Embedding Architectures

Three on-device patterns for sensitive RAG corpora in 2026:

  • Fully local: embedder + vector DB + generator all on one device. Best privacy; latency-bounded by single-machine throughput.
  • On-device retrieval, cloud generation: embeddings stay local, retrieved chunks sent to cloud LLM. Compromise; loses chunk privacy.
  • Federated retrieval: per-user local index; central LLM never sees user docs, only synthesised queries. Emerging pattern for personal AI.

Brand Visibility Implications

On-device RAG is the AI surface where brand information lives in customer-controlled corpora. When an enterprise loads its sales collateral, support docs, and competitor analysis into a local RAG, the resulting AI answers about brands are shaped by what is or is not in that corpus. Brands not represented in customer-facing on-device RAG corpora are systematically under-recommended in AI flows powered by local RAG, regardless of cloud-API visibility. See the local LLM blind spot page for the operational implication.

Methodology

Retrieval latency from FAISS benchmarks and the LanceDB documentation. Embedding throughput from BGE-M3 model card and community benchmarks. Generation throughput from our companion tps benchmarks page. Real workloads diverge with chunk-size and reranker choices; treat as guidance. Updated quarterly.

How Presenc AI Helps

Presenc AI's local-deployment instrumentation captures brand-mention rates in on-device RAG outputs, the surface where corpus content drives recommendations and cloud-API observability is blind. For brands ensuring presence in customer-controlled RAG corpora, this is the operational feedback loop.

Frequently Asked Questions

For interactive chat with 8B-class generators, end-to-end response in 1-3 seconds is achievable on prosumer hardware, fully production-ready. For 70B-class generators, 5-8 seconds end-to-end on workstations is acceptable for power-user and analyst workflows but slow for consumer chat.
Up to roughly 50M documents with full memory-resident HNSW indexing at 768-dim embeddings on 128GB unified-memory devices. Up to 500M documents with IVF-PQ or DiskANN at the cost of slightly higher latency. Beyond that, distributed vector DBs are required.
BGE-M3 (multilingual, 568M params) is the strongest open-weight all-rounder in 2026. Nomic Embed v2, Cohere Embed-Multilingual (open), and Snowflake Arctic Embed are also production-ready. For English-only English-language code RAG, smaller models (BGE-Small 100M) are often sufficient.
For retrieval, yes, the same FAISS / LanceDB / Qdrant runs on-device as runs in cloud. For generation, on-device frontier-quality models (Llama 4 70B Q4) are within 5-10 percent of GPT-4o-class generators on RAG-quality benchmarks. The gap is shrinking; quality is rarely the limiting factor in 2026.
For corpora over a few hundred thousand tokens, traditional RAG (separate retriever + generator) is faster, cheaper, and more accurate than long-context in-context retrieval. For small corpora that fit in 1M-token context windows, in-context can be competitive. The two patterns increasingly coexist in production stacks.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.