Research

LLM Context Window Race 2023-2026

How frontier LLM context windows expanded from 4K to 10M tokens between 2023 and 2026, vendor by vendor. Includes claimed-vs-effective context window data from the RULER benchmark and brand-visibility implications for long-context retrieval.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

From 4K to 10M Tokens in 36 Months

Between early 2023 and May 2026, frontier LLM context windows expanded roughly 1,000x. The default flagship context window moved from 4K-8K tokens to 1M-2M tokens. Llama 4 Scout pushed the public ceiling to 10M tokens; Magic.dev demonstrated 100M in lab settings. This page tracks the full timeline and pairs claimed numbers against effective-context-window scores from the NVIDIA RULER benchmark, the most-cited measurement of whether a model can actually reason over its claimed window.

Context Window Timeline, Selected Frontier Launches

DateModelVendorContext (tokens)
Nov 2022GPT-3.5OpenAI4K (later 16K)
Mar 2023GPT-4OpenAI8K / 32K
Mar 2023Claude 1Anthropic9K
Jul 2023Claude 2Anthropic100K
Nov 2023GPT-4 TurboOpenAI128K
Nov 2023Claude 2.1Anthropic200K
Feb 2024Gemini 1.5 ProGoogle1M
Apr 2024Gemini 1.5 Pro (expanded)Google2M
Apr 2024Llama 3Meta8K
May 2024GPT-4oOpenAI128K
Jul 2024Llama 3.1Meta128K
Mar 2025Gemini 2.5 ProGoogle1M
Apr 2025Llama 4 ScoutMeta10M
May 2025Claude Sonnet 4Anthropic1M (beta)
Feb 2026Gemini 3.1 ProGoogle1M
Mar 2026Claude Opus 4.6 / Sonnet 4.6 GAAnthropic1M (flat rate)
Apr 2026DeepSeek-V4DeepSeek1M
2026grok-4.20xAI2M

Current State (May 2026), by Vendor

ModelVendorClaimed ContextOutput CapPricing Note
Llama 4 ScoutMeta10M~32KHosted by Groq, Together, etc.
grok-4.20xAI2M~32KFlat input rate
Gemini 2.5 ProGoogle1M~65K2x surcharge above 200K input
Gemini 3.1 ProGoogle1M65K2x surcharge above 200K input
Claude Opus 4.7Anthropic200K~32K1M tier requires preview access
Claude Sonnet 4.6Anthropic1M~64KFlat rate, no surcharge
Claude Haiku 4.5Anthropic200K~32KFlat rate
DeepSeek-V4-Flash / ProDeepSeek1M384KCache-hit discount up to 99 percent
GPT-5.5OpenAI~270K~128KLong-context rates apply above 270K
Mistral Large 3 / Medium 3.5Mistral128K~32KFlat rate
Cohere Command R+Cohere128K~4KFlat rate

Claimed vs Effective: RULER Benchmark Findings

The headline number is not the operating number. NVIDIA's RULER benchmark measures actual retrieval, multi-hop tracing, and aggregation across the full claimed window. Findings published through 2026 indicate that effective context (the length over which a model retains its short-context accuracy) is typically 50-65 percent of the advertised number.

ModelRULER Score at 4KRULER Score at 128KDrop
Gemini 1.5 Pro~96~94~2 points (best-in-class retention)
GPT-4-110696.681.2~15 points
Llama 3.1-70B96.566.6~30 points

Pattern: Google's Gemini family retains performance across long context dramatically better than non-Google models. The 1M-token claim is closer to literally true for Gemini than for any other vendor. Llama 3.1-70B at 128K is operating at roughly 70 percent of its 4K accuracy floor, which means the second half of a 128K prompt is largely lost.

Five Things the Race Tells You

  1. The race is over for advertised capacity. 1M tokens is now table stakes for Google, Anthropic, xAI, and DeepSeek. The unique 2-10M outliers (Grok 4.20, Llama 4 Scout) are spec-sheet wins, not workload reality.
  2. Effective context lags claimed context by 30-60 percent. RULER and similar benchmarks show meaningful degradation past 32K-64K tokens for most non-Gemini models. Plan for the effective number, not the marketing number.
  3. Pricing discontinuities cluster at 200K input. Google charges roughly 2x more above 200K input on Gemini 2.5 Pro and Gemini 3.1 Pro. OpenAI applies long-context rates above ~270K. Anthropic Sonnet 4.6 is the only flagship with flat 1M pricing, removing the planning step.
  4. Output windows lag input windows by 10-30x. A 1M input window paired with a 32K output cap means the model can read a novel but only write a chapter. Long-context workloads are read-heavy by design.
  5. Cache-hit pricing changes the economics more than context length does. DeepSeek-V4-Flash cached input is $0.0028 per million tokens, roughly 50x cheaper than uncached. For RAG and document-QA workloads on stable corpora, cache pricing matters more than raw context window.

What This Means for AI Visibility and Brand-Recommendation Workloads

Long-context models are increasingly the substrate for agentic brand recommendations: a retrieval system loads vendor documentation, customer reviews, and competitive material into the prompt, then the model picks. Two things matter for brand visibility in this pipeline. First, the effective-context-window number determines whether a brand's information at token 800K of a 1M prompt actually influences the recommendation, or gets ignored due to long-context decay. Second, models with strong RULER retention (Gemini family today) preserve mid-context information that weaker long-context models drop, so a brand's placement in the retrieval order matters less. Brands optimising AI visibility should monitor both the headline context and the published effective-context measurements when prioritising platform coverage.

Methodology

Timeline data assembled from vendor model cards and announcement posts. Current-state context windows pulled from vendor pricing and developer documentation on May 14, 2026: OpenAI, Anthropic, Google, DeepSeek, xAI, Mistral. RULER scores from the NVIDIA RULER repo (github.com/NVIDIA/RULER) and the originating paper (arxiv:2404.06654). Output caps reflect documented maximum-output-tokens values; some vendors throttle below the documented ceiling under load. Refreshed quarterly.

How Presenc AI Helps

Presenc AI tracks how brand mentions surface across short-context flagship calls (the consumer-visible answer) and long-context retrieval pipelines (the agent doing the comparison). The same brand often ranks differently in a 4K direct prompt than in a 800K retrieval prompt, and the gap is where most AI-visibility programmes lose attribution. For brands building multi-tier visibility strategy, this is the signal that connects context-window mechanics to recommendation outcomes.

Frequently Asked Questions

Llama 4 Scout, at 10M tokens, has the largest advertised context window among publicly released frontier models. Grok 4.20 follows at 2M. Among the major flagship-tier proprietary models, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro, and DeepSeek-V4 all support 1M. Magic.dev demonstrated a 100M-token model (LTM-2) in lab settings; it is not generally available.
No. The NVIDIA RULER benchmark measures retrieval, multi-hop tracing, and aggregation across the full claimed window and consistently finds 30-60 percent degradation above 32K-64K tokens for most non-Gemini models. Effective context (the length over which a model retains its short-context accuracy) is typically 50-65 percent of the advertised number. Gemini 1.5 Pro is the standout outlier, retaining performance to roughly the full claim.
Approximately yes, in raw tokens. 1M tokens is roughly 750,000 English words, comfortably larger than most novels. But the question that matters operationally is whether the model can reason over the entire novel; for most non-Gemini frontier models at the time of writing, the second half of a 1M-token prompt is partially lost to long-context decay. Test with retrieval needles in your actual prompt distribution.
Output tokens are generated autoregressively (one at a time, each conditioned on all previous), so output cost and latency scale linearly with output length. Input tokens process in parallel during prefill, which is dramatically cheaper. Vendors size output caps to keep tail latency and per-request cost manageable. Long-context workloads are read-heavy by design.
The advertised-context curve is flattening as the major flagships converge at 1M. Outliers (Grok 4.20 at 2M, Llama 4 Scout at 10M) are spec-sheet wins not operational changes. The active research frontier moved to effective-context retention (RULER scores) and to cache pricing economics, both of which determine how usable a long context actually is.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.