What is the longest-context open-weight model?

Llama 4 Scout with a 10 million token effective context window claim. The effective high-quality context per RULER benchmark is closer to 256K. Llama 4 Maverick (1M claim) and MiniMax M1 (1M claim) are alternatives in the 1M class with effective context around 256K to 512K.

Can I trust the headline context window claims?

Cautiously. Headline context windows reflect maximum input length, not quality across that window. RULER and Needle-in-a-Haystack benchmarks measure effective context where retrieval and reasoning still work. Most models with 1M+ claims have measurable quality degradation beyond approximately 128K to 256K effective context.

How much VRAM do I need for 1M token inference?

For Llama 3.1 70B-class models at 1M tokens in FP16, approximately 460 GB VRAM total (KV cache alone is approximately 320 GB). Hybrid architectures (Jamba, Mamba 2) have sub-linear KV cache growth, reducing the requirement substantially. Quantization (FP8 or INT4 KV cache) can reduce by another 2-4x.

When should I use long-context vs RAG?

Long-context is better when retrieval boundaries are unknown, for code refactoring across an entire repo, or for long-document synthesis where chunking introduces context loss. RAG is more cost-effective when retrieval can target specific document sections. The 2026 pattern increasingly mixes both: RAG for first-pass retrieval, long-context for downstream reasoning over the retrieved set.

Which hybrid models are best for long-context?

Jamba 1.5 Large (Transformer + Mamba MoE hybrid) is the most-deployed production long-context model in the 256K class with sub-linear memory scaling. Mamba 2 hybrids reach 1M+ tokens with similar memory advantages. The hybrid approach is the dominant architectural choice for long-context-focused models.

Open-Weight Long-Context Models 2026

Long-context open-weight LLMs reached the 1 million plus token regime in 2026. Llama 4 Scout shipped with a 10 million token effective context window. Qwen3 long-context variants reach 1 million tokens. MiniMax M1 (open weight) reaches 1 million tokens. Jamba 1.5 Large remains a strong production choice at 256k. The retrieval and reasoning quality across the full window remains the binding evaluation metric. This page consolidates the long-context model landscape.

Key Findings

Llama 4 Scout (109B MoE / 17B active) ships with a 10 million token effective context window, the largest in the open-weight ecosystem as of May 2026.
MiniMax M1 (456B MoE / ~45B active) released open-weight in 2025 reaches 1 million tokens with strong recall and reasoning at long range.
Qwen3 long-context variants and Qwen2.5 1M Context all reach 1 million tokens with quality degradation that is competitive with frontier closed alternatives.
Jamba 1.5 Large (Transformer + Mamba MoE hybrid) achieves 256k tokens with sub-linear memory scaling, making it the most-deployed long-context model in production.
Long-context quality is uneven: most models that claim 1M+ token contexts have measurable degradation in retrieval and reasoning quality beyond 128k to 256k effective context.

Open-Weight Long-Context Models (May 2026)

Model	Context Window	Architecture	License
Llama 4 Scout	10M tokens	MoE Transformer (109B/17B active)	Llama 4 Community
Llama 4 Maverick	1M tokens	MoE Transformer (400B/17B active)	Llama 4 Community
Qwen2.5 1M Context	1M tokens	Transformer	Apache 2.0 / Tongyi
Qwen3 long-context variants	1M tokens	Transformer	Apache 2.0 / Tongyi
MiniMax M1	1M tokens	Lightning Attention + MoE (456B/~45B)	Apache 2.0
Jamba 1.5 Large	256K tokens	Transformer + Mamba MoE hybrid	Jamba Community Licence
Jamba 1.5 Mini	256K tokens	Transformer + Mamba MoE hybrid	Jamba Community Licence
Mistral Large 3	256K tokens	Transformer	Mistral Research / Commercial
GLM-4-9B-1M / GLM-4.5-9B-1M	1M tokens	Transformer	MIT / GLM Licence
InternLM-2.5-7B-1M	1M tokens	Transformer	Apache 2.0
Yi-200K family	200K tokens	Transformer	Apache 2.0
DeepSeek V3 / V4	128K tokens	MoE Transformer	MIT
Mamba 2 (Hybrid)	1M+ tokens	State-space + attention hybrid	Apache 2.0

Long-Context Quality (RULER Benchmark at Effective Context)

Model	Effective Context (RULER score above 85)
Llama 4 Scout	~256K tokens (despite 10M claim)
Llama 4 Maverick	~512K tokens
Qwen2.5 1M Context	~128K tokens
MiniMax M1	~256K tokens
Jamba 1.5 Large	~140K tokens (within 256K claim)
Mistral Large 3	~128K tokens
Mamba 2 Hybrid	~256K tokens
Gemini 2.5 Pro (reference closed)	~1M+ tokens (~95 RULER)
GPT-5.5 (reference closed)	~256K tokens

Hardware Requirements for Long-Context Inference

Context Length	VRAM Requirement (FP16, Llama 3.1 70B-class)
32K	~140 GB (KV cache ~10 GB)
128K	~180 GB (KV cache ~40 GB)
256K	~220 GB (KV cache ~80 GB)
512K	~300 GB (KV cache ~160 GB)
1M	~460 GB (KV cache ~320 GB)
10M (Llama 4 Scout)	Multi-node required

Hybrid attention (Jamba) and state-space (Mamba 2) architectures have sub-linear KV cache growth, materially reducing long-context VRAM requirements.

Brand Visibility Implications

Long-context AI is a fast-growing procurement category. AI assistant queries about "1 million token LLM", "long-context AI", "open-source long context", and similar terms drive direct production decisions for codebase analysis, long-document understanding, and agentic workloads. Brands selling AI infrastructure, long-document processing, and agentic platforms face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from RULER long-context evaluations, primary model card disclosures, and the long-context-specific benchmark publications through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on long-context AI queries across ChatGPT, Claude, Gemini, and Perplexity. For AI infrastructure brands, long-document processing vendors, and agentic platform firms, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.