The Local RAG Stack in 2026
Retrieval-augmented generation moved on-device in 2026 as enterprise concerns about embedding-vector privacy and data residency reached a tipping point. Local RAG combines three components: an embedding model (typically 100M-1B parameters), a vector database (FAISS, Chroma, LanceDB, Qdrant), and a generation LLM (7B-70B). End-to-end performance depends on all three. This page consolidates published benchmarks across hardware tiers.
Key Findings
- For 1M-document corpora at 768-dim embeddings, Mac Studio M5 Max sustains approximately 2,400 retrievals per second with FAISS HNSW indexing, comparable to mid-tier server hardware.
- End-to-end RAG response time (retrieval + generation) on DGX Spark with a 70B Q4 generator and 1M-doc corpus runs approximately 3.5-5.5 seconds for a typical 200-token answer; on Mac Studio M5 Max, 5-8 seconds.
- Embedding generation is the surprise bottleneck for large corpora: indexing 10M documents on a single workstation takes 8-30 hours depending on embedding model and hardware.
- Memory-mapped vector indexes scale to 50M+ documents on 128GB unified-memory workstations without tripping into disk-paged retrieval, the practical ceiling for single-machine RAG.
- Local RAG quality with frontier-class generation (Llama 4 70B, Qwen 3 32B) is competitive with cloud-API RAG using GPT-4o-class generators, the dominant trade-off in 2026 is latency and indexing throughput, not answer quality.
Retrieval Latency by Hardware (1M-doc corpus, top-10 retrieval, 768-dim embeddings)
| Hardware | FAISS HNSW | FAISS IVF-PQ | LanceDB |
|---|---|---|---|
| NVIDIA DGX Spark | 1.2-2.0 ms | 0.6-1.1 ms | 1.5-2.5 ms |
| Mac Studio M5 Max | 1.5-2.4 ms | 0.8-1.3 ms | 1.8-3.0 ms |
| RTX 5090 build | 1.0-1.7 ms | 0.5-0.9 ms | 1.4-2.2 ms |
| Mac mini M4 32GB | 2.2-3.5 ms | 1.2-1.9 ms | 2.5-4.0 ms |
At sub-5ms retrieval latency on all tiers, retrieval is not the bottleneck for end-to-end response time, generation dominates by orders of magnitude.
Embedding Generation Throughput (BGE-M3 568M-param embedder, batch 32)
| Hardware | Embeddings/second | Time to embed 1M docs | Time to embed 10M docs |
|---|---|---|---|
| NVIDIA DGX Spark | ~1,800/s | ~9 minutes | ~1.5 hours |
| RTX 5090 | ~2,400/s | ~7 minutes | ~1.2 hours |
| Mac Studio M5 Max | ~700/s | ~24 minutes | ~4 hours |
| Mac mini M4 32GB | ~280/s | ~60 minutes | ~10 hours |
Embedding throughput skews more strongly toward NVIDIA hardware than inference does, the reason is the small-model batched workload favours GPU compute density.
End-to-End RAG Response Time
| Hardware | Generator (Q4) | Retrieval | Generation (200 tokens) | Total |
|---|---|---|---|---|
| DGX Spark | Llama 4 70B | ~2 ms | ~5 seconds | ~5 seconds |
| DGX Spark | Llama 4 8B | ~2 ms | ~1.7 seconds | ~1.7 seconds |
| Mac M5 Max | Llama 4 70B | ~2 ms | ~7 seconds | ~7 seconds |
| Mac M5 Max | Llama 4 8B | ~2 ms | ~2.0 seconds | ~2.0 seconds |
| RTX 5090 | Llama 4 8B | ~1 ms | ~1.4 seconds | ~1.4 seconds |
| Mac mini M4 | Llama 4 8B Q4 | ~3 ms | ~3.0 seconds | ~3.0 seconds |
Practical Index-Size Ceilings on 128GB Workstations
FAISS HNSW with 768-dim float32 embeddings consumes roughly 4KB per document plus index overhead. Practical ceilings on 128GB workstations:
- 50M documents at full HNSW (memory-resident): comfortable
- 100M documents at HNSW with 8-bit embedding quantization: feasible
- 500M documents requires IVF-PQ or memory-mapped DiskANN, single-workstation territory ends here
- Beyond 500M documents: distributed vector DB (Milvus, Qdrant cluster) regardless of workstation memory
Privacy-Preserving Embedding Architectures
Three on-device patterns for sensitive RAG corpora in 2026:
- Fully local: embedder + vector DB + generator all on one device. Best privacy; latency-bounded by single-machine throughput.
- On-device retrieval, cloud generation: embeddings stay local, retrieved chunks sent to cloud LLM. Compromise; loses chunk privacy.
- Federated retrieval: per-user local index; central LLM never sees user docs, only synthesised queries. Emerging pattern for personal AI.
Brand Visibility Implications
On-device RAG is the AI surface where brand information lives in customer-controlled corpora. When an enterprise loads its sales collateral, support docs, and competitor analysis into a local RAG, the resulting AI answers about brands are shaped by what is or is not in that corpus. Brands not represented in customer-facing on-device RAG corpora are systematically under-recommended in AI flows powered by local RAG, regardless of cloud-API visibility. See the local LLM blind spot page for the operational implication.
Methodology
Retrieval latency from FAISS benchmarks and the LanceDB documentation. Embedding throughput from BGE-M3 model card and community benchmarks. Generation throughput from our companion tps benchmarks page. Real workloads diverge with chunk-size and reranker choices; treat as guidance. Updated quarterly.
How Presenc AI Helps
Presenc AI's local-deployment instrumentation captures brand-mention rates in on-device RAG outputs, the surface where corpus content drives recommendations and cloud-API observability is blind. For brands ensuring presence in customer-controlled RAG corpora, this is the operational feedback loop.