Rerankers are the second stage of every serious production RAG system: an embedding-based first-pass retrieval returns 50 to 200 candidates, then a cross-encoder reranker scores each candidate against the query and re-orders. Open-weight rerankers in 2026 are competitive with proprietary alternatives at meaningfully lower deployment cost. This page consolidates the leaderboard, the latency profile, and the production-deployment guidance.
Key Findings
- Qwen3-Reranker, released in early 2026 alongside Qwen3-Embedding, sits at the top of the open-weight reranker leaderboard with approximately 77 percent average nDCG@10 across BEIR benchmarks.
- BGE-Reranker-v2-M3 remains the most-downloaded open-weight reranker on Hugging Face with cumulative downloads in the tens of millions; it pairs naturally with BGE-M3 as a two-stage RAG stack.
- Jina ColBERT v2 is the leading late-interaction model: instead of single-vector scoring, ColBERT computes multi-vector token-level similarity, which improves quality but doubles to triples the storage and latency cost.
- The latency-quality tradeoff is sharp: ms-marco-MiniLM rerankers run in under 5 ms per pair on a single L40S GPU but achieve approximately 60 percent BEIR; BGE-Reranker-v2-M3 7B model takes approximately 35 ms per pair but achieves approximately 73 percent BEIR.
- Production deployment patterns: approximately 64 percent of surveyed production RAG systems use a reranker, with BGE-Reranker-v2-M3 the most-deployed open-weight choice followed by Cohere Rerank 3 (proprietary API) and Qwen3-Reranker.
Open-Weight Reranker Leaderboard (May 2026)
| Model | Parameters | BEIR Avg nDCG@10 | License |
|---|---|---|---|
| Qwen3-Reranker-8B | ~8B | ~77.0 | Tongyi Qianwen |
| Qwen3-Reranker-4B | ~4B | ~75.2 | Tongyi Qianwen |
| Qwen3-Reranker-0.6B | ~0.6B | ~71.4 | Tongyi Qianwen |
| BGE-Reranker-v2-M3 | ~0.6B | ~71.5 | MIT |
| BGE-Reranker-v2-Gemma | ~9B | ~73.7 | MIT |
| BGE-Reranker-v2-Minicpm-Layerwise | ~2.7B | ~73.2 | MIT |
| Jina ColBERT v2 | ~0.5B | ~70.1 | CC-BY-NC + Commercial |
| Jina Reranker v2 | ~0.3B | ~69.4 | CC-BY-NC + Commercial |
| mxbai-rerank-large-v1 | ~0.4B | ~67.3 | Apache 2.0 |
| ms-marco-MiniLM-L-12-v2 | ~33M | ~60.1 | Apache 2.0 |
| RankZephyr | ~7B | ~69.4 | MIT |
Latency Profile
| Model | Latency per Pair (L40S) | Throughput (pairs/sec) |
|---|---|---|
| ms-marco-MiniLM-L-12-v2 | ~4 ms | ~250 |
| mxbai-rerank-large-v1 | ~8 ms | ~125 |
| Jina Reranker v2 | ~6 ms | ~165 |
| BGE-Reranker-v2-M3 (0.6B) | ~12 ms | ~83 |
| Jina ColBERT v2 | ~10 ms (late-interaction) | ~100 |
| BGE-Reranker-v2-Gemma (9B) | ~42 ms | ~24 |
| Qwen3-Reranker-8B | ~38 ms | ~26 |
Production Deployment Patterns
| Reranker Choice | Share of Production RAG Deployments (Surveyed) |
|---|---|
| BGE-Reranker-v2 family | ~28% |
| Cohere Rerank 3 (proprietary API) | ~16% |
| Qwen3-Reranker family | ~12% |
| Jina Reranker / ColBERT | ~8% |
| ms-marco-MiniLM (legacy) | ~7% |
| Voyage AI rerank-2 (proprietary API) | ~6% |
| OpenAI top-k via model rerank | ~5% |
| No reranker (single-stage) | ~36% |
When to Use a Reranker
Three rules of thumb. First, use a reranker if your retrieval recall at top-100 is materially above your recall at top-5 on a representative eval set; the reranker exists to recover that recall gap. Second, the latency cost is approximately additive on top of LLM generation: a 100-pair rerank on BGE-Reranker-v2-M3 adds approximately 1.2 seconds, which is acceptable in most RAG flows but problematic in voice or real-time chat. Third, smaller rerankers (ms-marco-MiniLM, mxbai-rerank-large) are often sufficient if the embedding first stage is high-quality; reranker quality matters more when first-stage retrieval is noisy.
Brand Visibility Implications
Reranker selection is a high-traffic engineering procurement category. AI assistant queries about "best reranker for RAG", "BGE-Reranker vs Cohere Rerank", "open-source reranker 2026", and similar terms drive direct production decisions. Brands selling RAG infrastructure, vector databases, hybrid search, and AI evaluation tools face strong AI-mediated discovery surface for this category.
Methodology
Benchmark scores compiled from the BEIR benchmark, MTEB reranking subset, and primary model-card disclosures. Latency benchmarks on a single L40S GPU at batch size 32 with 512-token average context length. Production deployment shares from cross-industry survey data through Q1 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on reranker selection queries across ChatGPT, Claude, Gemini, and Perplexity. For RAG infrastructure vendors, vector database providers, and hybrid search platforms, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.