Which LLM has the largest context window in May 2026?

Llama 4 Scout, at 10M tokens, has the largest advertised context window among publicly released frontier models. Grok 4.20 follows at 2M. Among the major flagship-tier proprietary models, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro, and DeepSeek-V4 all support 1M. Magic.dev demonstrated a 100M-token model (LTM-2) in lab settings; it is not generally available.

Is claimed context window the same as effective context window?

No. The NVIDIA RULER benchmark measures retrieval, multi-hop tracing, and aggregation across the full claimed window and consistently finds 30-60 percent degradation above 32K-64K tokens for most non-Gemini models. Effective context (the length over which a model retains its short-context accuracy) is typically 50-65 percent of the advertised number. Gemini 1.5 Pro is the standout outlier, retaining performance to roughly the full claim.

Does a 1M context window mean I can paste a novel?

Approximately yes, in raw tokens. 1M tokens is roughly 750,000 English words, comfortably larger than most novels. But the question that matters operationally is whether the model can reason over the entire novel; for most non-Gemini frontier models at the time of writing, the second half of a 1M-token prompt is partially lost to long-context decay. Test with retrieval needles in your actual prompt distribution.

Why do output caps lag input windows?

Output tokens are generated autoregressively (one at a time, each conditioned on all previous), so output cost and latency scale linearly with output length. Input tokens process in parallel during prefill, which is dramatically cheaper. Vendors size output caps to keep tail latency and per-request cost manageable. Long-context workloads are read-heavy by design.

How fast is the context-window curve still moving?

The advertised-context curve is flattening as the major flagships converge at 1M. Outliers (Grok 4.20 at 2M, Llama 4 Scout at 10M) are spec-sheet wins not operational changes. The active research frontier moved to effective-context retention (RULER scores) and to cache pricing economics, both of which determine how usable a long context actually is.

LLM Context Window Race 2023-2026: Claimed vs Effective

From 4K to 10M Tokens in 36 Months

Between early 2023 and May 2026, frontier LLM context windows expanded roughly 1,000x. The default flagship context window moved from 4K-8K tokens to 1M-2M tokens. Llama 4 Scout pushed the public ceiling to 10M tokens; Magic.dev demonstrated 100M in lab settings. This page tracks the full timeline and pairs claimed numbers against effective-context-window scores from the NVIDIA RULER benchmark, the most-cited measurement of whether a model can actually reason over its claimed window.

Context Window Timeline, Selected Frontier Launches

Date	Model	Vendor	Context (tokens)
Nov 2022	GPT-3.5	OpenAI	4K (later 16K)
Mar 2023	GPT-4	OpenAI	8K / 32K
Mar 2023	Claude 1	Anthropic	9K
Jul 2023	Claude 2	Anthropic	100K
Nov 2023	GPT-4 Turbo	OpenAI	128K
Nov 2023	Claude 2.1	Anthropic	200K
Feb 2024	Gemini 1.5 Pro	Google	1M
Apr 2024	Gemini 1.5 Pro (expanded)	Google	2M
Apr 2024	Llama 3	Meta	8K
May 2024	GPT-4o	OpenAI	128K
Jul 2024	Llama 3.1	Meta	128K
Mar 2025	Gemini 2.5 Pro	Google	1M
Apr 2025	Llama 4 Scout	Meta	10M
May 2025	Claude Sonnet 4	Anthropic	1M (beta)
Feb 2026	Gemini 3.1 Pro	Google	1M
Mar 2026	Claude Opus 4.6 / Sonnet 4.6 GA	Anthropic	1M (flat rate)
Apr 2026	DeepSeek-V4	DeepSeek	1M
2026	grok-4.20	xAI	2M

Current State (May 2026), by Vendor

Model	Vendor	Claimed Context	Output Cap	Pricing Note
Llama 4 Scout	Meta	10M	~32K	Hosted by Groq, Together, etc.
grok-4.20	xAI	2M	~32K	Flat input rate
Gemini 2.5 Pro	Google	1M	~65K	2x surcharge above 200K input
Gemini 3.1 Pro	Google	1M	65K	2x surcharge above 200K input
Claude Opus 4.7	Anthropic	200K	~32K	1M tier requires preview access
Claude Sonnet 4.6	Anthropic	1M	~64K	Flat rate, no surcharge
Claude Haiku 4.5	Anthropic	200K	~32K	Flat rate
DeepSeek-V4-Flash / Pro	DeepSeek	1M	384K	Cache-hit discount up to 99 percent
GPT-5.5	OpenAI	~270K	~128K	Long-context rates apply above 270K
Mistral Large 3 / Medium 3.5	Mistral	128K	~32K	Flat rate
Cohere Command R+	Cohere	128K	~4K	Flat rate

Claimed vs Effective: RULER Benchmark Findings

The headline number is not the operating number. NVIDIA's RULER benchmark measures actual retrieval, multi-hop tracing, and aggregation across the full claimed window. Findings published through 2026 indicate that effective context (the length over which a model retains its short-context accuracy) is typically 50-65 percent of the advertised number.

Model	RULER Score at 4K	RULER Score at 128K	Drop
Gemini 1.5 Pro	~96	~94	~2 points (best-in-class retention)
GPT-4-1106	96.6	81.2	~15 points
Llama 3.1-70B	96.5	66.6	~30 points

Pattern: Google's Gemini family retains performance across long context dramatically better than non-Google models. The 1M-token claim is closer to literally true for Gemini than for any other vendor. Llama 3.1-70B at 128K is operating at roughly 70 percent of its 4K accuracy floor, which means the second half of a 128K prompt is largely lost.

Five Things the Race Tells You

The race is over for advertised capacity. 1M tokens is now table stakes for Google, Anthropic, xAI, and DeepSeek. The unique 2-10M outliers (Grok 4.20, Llama 4 Scout) are spec-sheet wins, not workload reality.
Effective context lags claimed context by 30-60 percent. RULER and similar benchmarks show meaningful degradation past 32K-64K tokens for most non-Gemini models. Plan for the effective number, not the marketing number.
Pricing discontinuities cluster at 200K input. Google charges roughly 2x more above 200K input on Gemini 2.5 Pro and Gemini 3.1 Pro. OpenAI applies long-context rates above ~270K. Anthropic Sonnet 4.6 is the only flagship with flat 1M pricing, removing the planning step.
Output windows lag input windows by 10-30x. A 1M input window paired with a 32K output cap means the model can read a novel but only write a chapter. Long-context workloads are read-heavy by design.
Cache-hit pricing changes the economics more than context length does. DeepSeek-V4-Flash cached input is $0.0028 per million tokens, roughly 50x cheaper than uncached. For RAG and document-QA workloads on stable corpora, cache pricing matters more than raw context window.

What This Means for AI Visibility and Brand-Recommendation Workloads

Long-context models are increasingly the substrate for agentic brand recommendations: a retrieval system loads vendor documentation, customer reviews, and competitive material into the prompt, then the model picks. Two things matter for brand visibility in this pipeline. First, the effective-context-window number determines whether a brand's information at token 800K of a 1M prompt actually influences the recommendation, or gets ignored due to long-context decay. Second, models with strong RULER retention (Gemini family today) preserve mid-context information that weaker long-context models drop, so a brand's placement in the retrieval order matters less. Brands optimising AI visibility should monitor both the headline context and the published effective-context measurements when prioritising platform coverage.

Methodology

Timeline data assembled from vendor model cards and announcement posts. Current-state context windows pulled from vendor pricing and developer documentation on May 14, 2026: OpenAI, Anthropic, Google, DeepSeek, xAI, Mistral. RULER scores from the NVIDIA RULER repo (github.com/NVIDIA/RULER) and the originating paper (arxiv:2404.06654). Output caps reflect documented maximum-output-tokens values; some vendors throttle below the documented ceiling under load. Refreshed quarterly.

How Presenc AI Helps

Presenc AI tracks how brand mentions surface across short-context flagship calls (the consumer-visible answer) and long-context retrieval pipelines (the agent doing the comparison). The same brand often ranks differently in a 4K direct prompt than in a 800K retrieval prompt, and the gap is where most AI-visibility programmes lose attribution. For brands building multi-tier visibility strategy, this is the signal that connects context-window mechanics to recommendation outcomes.

LLM Context Window Race 2023-2026