Is on-device RAG fast enough for production?

For interactive chat with 8B-class generators, end-to-end response in 1-3 seconds is achievable on prosumer hardware, fully production-ready. For 70B-class generators, 5-8 seconds end-to-end on workstations is acceptable for power-user and analyst workflows but slow for consumer chat.

How big a document corpus can run on a single workstation?

Up to roughly 50M documents with full memory-resident HNSW indexing at 768-dim embeddings on 128GB unified-memory devices. Up to 500M documents with IVF-PQ or DiskANN at the cost of slightly higher latency. Beyond that, distributed vector DBs are required.

What embedding model should I use locally?

BGE-M3 (multilingual, 568M params) is the strongest open-weight all-rounder in 2026. Nomic Embed v2, Cohere Embed-Multilingual (open), and Snowflake Arctic Embed are also production-ready. For English-only English-language code RAG, smaller models (BGE-Small 100M) are often sufficient.

Does on-device RAG match cloud RAG quality?

For retrieval, yes, the same FAISS / LanceDB / Qdrant runs on-device as runs in cloud. For generation, on-device frontier-quality models (Llama 4 70B Q4) are within 5-10 percent of GPT-4o-class generators on RAG-quality benchmarks. The gap is shrinking; quality is rarely the limiting factor in 2026.

How does on-device RAG compare to in-context retrieval (long-context LLMs)?

For corpora over a few hundred thousand tokens, traditional RAG (separate retriever + generator) is faster, cheaper, and more accurate than long-context in-context retrieval. For small corpora that fit in 1M-token context windows, in-context can be competitive. The two patterns increasingly coexist in production stacks.

On-Device RAG Performance Benchmarks 2026

The Local RAG Stack in 2026

Retrieval-augmented generation moved on-device in 2026 as enterprise concerns about embedding-vector privacy and data residency reached a tipping point. Local RAG combines three components: an embedding model (typically 100M-1B parameters), a vector database (FAISS, Chroma, LanceDB, Qdrant), and a generation LLM (7B-70B). End-to-end performance depends on all three. This page consolidates published benchmarks across hardware tiers.

Key Findings

For 1M-document corpora at 768-dim embeddings, Mac Studio M5 Max sustains approximately 2,400 retrievals per second with FAISS HNSW indexing, comparable to mid-tier server hardware.
End-to-end RAG response time (retrieval + generation) on DGX Spark with a 70B Q4 generator and 1M-doc corpus runs approximately 3.5-5.5 seconds for a typical 200-token answer; on Mac Studio M5 Max, 5-8 seconds.
Embedding generation is the surprise bottleneck for large corpora: indexing 10M documents on a single workstation takes 8-30 hours depending on embedding model and hardware.
Memory-mapped vector indexes scale to 50M+ documents on 128GB unified-memory workstations without tripping into disk-paged retrieval, the practical ceiling for single-machine RAG.
Local RAG quality with frontier-class generation (Llama 4 70B, Qwen 3 32B) is competitive with cloud-API RAG using GPT-4o-class generators, the dominant trade-off in 2026 is latency and indexing throughput, not answer quality.

Retrieval Latency by Hardware (1M-doc corpus, top-10 retrieval, 768-dim embeddings)

Hardware	FAISS HNSW	FAISS IVF-PQ	LanceDB
NVIDIA DGX Spark	1.2-2.0 ms	0.6-1.1 ms	1.5-2.5 ms
Mac Studio M5 Max	1.5-2.4 ms	0.8-1.3 ms	1.8-3.0 ms
RTX 5090 build	1.0-1.7 ms	0.5-0.9 ms	1.4-2.2 ms
Mac mini M4 32GB	2.2-3.5 ms	1.2-1.9 ms	2.5-4.0 ms

At sub-5ms retrieval latency on all tiers, retrieval is not the bottleneck for end-to-end response time, generation dominates by orders of magnitude.

Embedding Generation Throughput (BGE-M3 568M-param embedder, batch 32)

Hardware	Embeddings/second	Time to embed 1M docs	Time to embed 10M docs
NVIDIA DGX Spark	~1,800/s	~9 minutes	~1.5 hours
RTX 5090	~2,400/s	~7 minutes	~1.2 hours
Mac Studio M5 Max	~700/s	~24 minutes	~4 hours
Mac mini M4 32GB	~280/s	~60 minutes	~10 hours

Embedding throughput skews more strongly toward NVIDIA hardware than inference does, the reason is the small-model batched workload favours GPU compute density.

End-to-End RAG Response Time

Hardware	Generator (Q4)	Retrieval	Generation (200 tokens)	Total
DGX Spark	Llama 4 70B	~2 ms	~5 seconds	~5 seconds
DGX Spark	Llama 4 8B	~2 ms	~1.7 seconds	~1.7 seconds
Mac M5 Max	Llama 4 70B	~2 ms	~7 seconds	~7 seconds
Mac M5 Max	Llama 4 8B	~2 ms	~2.0 seconds	~2.0 seconds
RTX 5090	Llama 4 8B	~1 ms	~1.4 seconds	~1.4 seconds
Mac mini M4	Llama 4 8B Q4	~3 ms	~3.0 seconds	~3.0 seconds

Practical Index-Size Ceilings on 128GB Workstations

FAISS HNSW with 768-dim float32 embeddings consumes roughly 4KB per document plus index overhead. Practical ceilings on 128GB workstations:

50M documents at full HNSW (memory-resident): comfortable
100M documents at HNSW with 8-bit embedding quantization: feasible
500M documents requires IVF-PQ or memory-mapped DiskANN, single-workstation territory ends here
Beyond 500M documents: distributed vector DB (Milvus, Qdrant cluster) regardless of workstation memory

Privacy-Preserving Embedding Architectures

Three on-device patterns for sensitive RAG corpora in 2026:

Fully local: embedder + vector DB + generator all on one device. Best privacy; latency-bounded by single-machine throughput.
On-device retrieval, cloud generation: embeddings stay local, retrieved chunks sent to cloud LLM. Compromise; loses chunk privacy.
Federated retrieval: per-user local index; central LLM never sees user docs, only synthesised queries. Emerging pattern for personal AI.

Brand Visibility Implications

On-device RAG is the AI surface where brand information lives in customer-controlled corpora. When an enterprise loads its sales collateral, support docs, and competitor analysis into a local RAG, the resulting AI answers about brands are shaped by what is or is not in that corpus. Brands not represented in customer-facing on-device RAG corpora are systematically under-recommended in AI flows powered by local RAG, regardless of cloud-API visibility. See the local LLM blind spot page for the operational implication.

Methodology

Retrieval latency from FAISS benchmarks and the LanceDB documentation. Embedding throughput from BGE-M3 model card and community benchmarks. Generation throughput from our companion tps benchmarks page. Real workloads diverge with chunk-size and reranker choices; treat as guidance. Updated quarterly.

How Presenc AI Helps

Presenc AI's local-deployment instrumentation captures brand-mention rates in on-device RAG outputs, the surface where corpus content drives recommendations and cloud-API observability is blind. For brands ensuring presence in customer-controlled RAG corpora, this is the operational feedback loop.