From 4K to 10M Tokens in 36 Months
Between early 2023 and May 2026, frontier LLM context windows expanded roughly 1,000x. The default flagship context window moved from 4K-8K tokens to 1M-2M tokens. Llama 4 Scout pushed the public ceiling to 10M tokens; Magic.dev demonstrated 100M in lab settings. This page tracks the full timeline and pairs claimed numbers against effective-context-window scores from the NVIDIA RULER benchmark, the most-cited measurement of whether a model can actually reason over its claimed window.
Context Window Timeline, Selected Frontier Launches
| Date | Model | Vendor | Context (tokens) |
|---|---|---|---|
| Nov 2022 | GPT-3.5 | OpenAI | 4K (later 16K) |
| Mar 2023 | GPT-4 | OpenAI | 8K / 32K |
| Mar 2023 | Claude 1 | Anthropic | 9K |
| Jul 2023 | Claude 2 | Anthropic | 100K |
| Nov 2023 | GPT-4 Turbo | OpenAI | 128K |
| Nov 2023 | Claude 2.1 | Anthropic | 200K |
| Feb 2024 | Gemini 1.5 Pro | 1M | |
| Apr 2024 | Gemini 1.5 Pro (expanded) | 2M | |
| Apr 2024 | Llama 3 | Meta | 8K |
| May 2024 | GPT-4o | OpenAI | 128K |
| Jul 2024 | Llama 3.1 | Meta | 128K |
| Mar 2025 | Gemini 2.5 Pro | 1M | |
| Apr 2025 | Llama 4 Scout | Meta | 10M |
| May 2025 | Claude Sonnet 4 | Anthropic | 1M (beta) |
| Feb 2026 | Gemini 3.1 Pro | 1M | |
| Mar 2026 | Claude Opus 4.6 / Sonnet 4.6 GA | Anthropic | 1M (flat rate) |
| Apr 2026 | DeepSeek-V4 | DeepSeek | 1M |
| 2026 | grok-4.20 | xAI | 2M |
Current State (May 2026), by Vendor
| Model | Vendor | Claimed Context | Output Cap | Pricing Note |
|---|---|---|---|---|
| Llama 4 Scout | Meta | 10M | ~32K | Hosted by Groq, Together, etc. |
| grok-4.20 | xAI | 2M | ~32K | Flat input rate |
| Gemini 2.5 Pro | 1M | ~65K | 2x surcharge above 200K input | |
| Gemini 3.1 Pro | 1M | 65K | 2x surcharge above 200K input | |
| Claude Opus 4.7 | Anthropic | 200K | ~32K | 1M tier requires preview access |
| Claude Sonnet 4.6 | Anthropic | 1M | ~64K | Flat rate, no surcharge |
| Claude Haiku 4.5 | Anthropic | 200K | ~32K | Flat rate |
| DeepSeek-V4-Flash / Pro | DeepSeek | 1M | 384K | Cache-hit discount up to 99 percent |
| GPT-5.5 | OpenAI | ~270K | ~128K | Long-context rates apply above 270K |
| Mistral Large 3 / Medium 3.5 | Mistral | 128K | ~32K | Flat rate |
| Cohere Command R+ | Cohere | 128K | ~4K | Flat rate |
Claimed vs Effective: RULER Benchmark Findings
The headline number is not the operating number. NVIDIA's RULER benchmark measures actual retrieval, multi-hop tracing, and aggregation across the full claimed window. Findings published through 2026 indicate that effective context (the length over which a model retains its short-context accuracy) is typically 50-65 percent of the advertised number.
| Model | RULER Score at 4K | RULER Score at 128K | Drop |
|---|---|---|---|
| Gemini 1.5 Pro | ~96 | ~94 | ~2 points (best-in-class retention) |
| GPT-4-1106 | 96.6 | 81.2 | ~15 points |
| Llama 3.1-70B | 96.5 | 66.6 | ~30 points |
Pattern: Google's Gemini family retains performance across long context dramatically better than non-Google models. The 1M-token claim is closer to literally true for Gemini than for any other vendor. Llama 3.1-70B at 128K is operating at roughly 70 percent of its 4K accuracy floor, which means the second half of a 128K prompt is largely lost.
Five Things the Race Tells You
- The race is over for advertised capacity. 1M tokens is now table stakes for Google, Anthropic, xAI, and DeepSeek. The unique 2-10M outliers (Grok 4.20, Llama 4 Scout) are spec-sheet wins, not workload reality.
- Effective context lags claimed context by 30-60 percent. RULER and similar benchmarks show meaningful degradation past 32K-64K tokens for most non-Gemini models. Plan for the effective number, not the marketing number.
- Pricing discontinuities cluster at 200K input. Google charges roughly 2x more above 200K input on Gemini 2.5 Pro and Gemini 3.1 Pro. OpenAI applies long-context rates above ~270K. Anthropic Sonnet 4.6 is the only flagship with flat 1M pricing, removing the planning step.
- Output windows lag input windows by 10-30x. A 1M input window paired with a 32K output cap means the model can read a novel but only write a chapter. Long-context workloads are read-heavy by design.
- Cache-hit pricing changes the economics more than context length does. DeepSeek-V4-Flash cached input is $0.0028 per million tokens, roughly 50x cheaper than uncached. For RAG and document-QA workloads on stable corpora, cache pricing matters more than raw context window.
What This Means for AI Visibility and Brand-Recommendation Workloads
Long-context models are increasingly the substrate for agentic brand recommendations: a retrieval system loads vendor documentation, customer reviews, and competitive material into the prompt, then the model picks. Two things matter for brand visibility in this pipeline. First, the effective-context-window number determines whether a brand's information at token 800K of a 1M prompt actually influences the recommendation, or gets ignored due to long-context decay. Second, models with strong RULER retention (Gemini family today) preserve mid-context information that weaker long-context models drop, so a brand's placement in the retrieval order matters less. Brands optimising AI visibility should monitor both the headline context and the published effective-context measurements when prioritising platform coverage.
Methodology
Timeline data assembled from vendor model cards and announcement posts. Current-state context windows pulled from vendor pricing and developer documentation on May 14, 2026: OpenAI, Anthropic, Google, DeepSeek, xAI, Mistral. RULER scores from the NVIDIA RULER repo (github.com/NVIDIA/RULER) and the originating paper (arxiv:2404.06654). Output caps reflect documented maximum-output-tokens values; some vendors throttle below the documented ceiling under load. Refreshed quarterly.
How Presenc AI Helps
Presenc AI tracks how brand mentions surface across short-context flagship calls (the consumer-visible answer) and long-context retrieval pipelines (the agent doing the comparison). The same brand often ranks differently in a 4K direct prompt than in a 800K retrieval prompt, and the gap is where most AI-visibility programmes lose attribution. For brands building multi-tier visibility strategy, this is the signal that connects context-window mechanics to recommendation outcomes.