Which frontier LLM is cheapest per token in May 2026?

Among cost-leader frontier-quality models, DeepSeek-V4-Flash at $0.14 input / $0.28 output per million tokens is the floor with a 1M-token context. Among major US vendors, Gemini 2.5 Flash-Lite ($0.10 input / $0.40 output) and GPT-4.1 Nano (same) lead the cost-leader tier. The cheapest reasoning-capable model from a major US vendor is GPT-5.4 Mini at $0.75 / $4.50.

Why is output more expensive than input?

Output tokens require a forward pass per token (autoregressive decoding), while input tokens are processed in parallel through prefill. Decoding is the bottleneck on inference hardware. The 4-6x input-to-output multiplier across vendors reflects this asymmetry plus the higher value of generation versus parsing.

Do caching and batch discounts apply across all vendors?

Most major vendors offer both. Anthropic batch processing is approximately 50 percent off and prompt caching reduces cached input by approximately 90 percent. OpenAI offers batch and flex pricing at approximately 50 percent off. Google offers context caching with similar savings. DeepSeek-V4-Flash cache-hit pricing falls to $0.0028 per million input tokens. Workloads with stable system prompts or large retrieved contexts benefit the most.

What is the catch with Gemini Pro tiered pricing?

Gemini 2.5 Pro and Gemini 3.1 Pro Preview charge roughly 2x more for prompts above 200K input tokens. Below 200K, Gemini 2.5 Pro is $1.25 input; above 200K it becomes $2.50. Plan request shapes accordingly, especially for RAG workloads that may approach the threshold. Anthropic Sonnet 4.6 is the only flagship with a flat 1M-token context rate.

Are these prices stable?

No. Frontier LLM pricing has dropped roughly 80-90 percent on capability-equivalent models since GPT-4 launch in March 2023. Expect quarterly downward revisions on cost-leader tiers, occasional new flagship launches at high prices, and irregular temporary discounts (DeepSeek-V4-Pro has a 75 percent discount through May 31, 2026 in this snapshot). Refresh pricing before any multi-quarter cost projection.

LLM API Pricing Comparison May 2026

Frontier LLM API Pricing at a Glance, May 2026

This page consolidates published per-million-token API pricing for the major frontier LLMs in May 2026. Numbers below are pulled from each vendor's own pricing page or developer documentation and represent standard pay-as-you-go rates, not committed-use, batch, or enterprise pricing. Caching, batch, and long-context surcharges are noted where the vendor publishes them.

Frontier Tier Comparison (Standard Pricing, USD per 1M Tokens)

Vendor	Model	Input	Output	Context
OpenAI	GPT-5.5 Pro	$30.00	$180.00	~270K
OpenAI	GPT-5.5	$5.00	$30.00	~270K
OpenAI	GPT-5.4	$2.50	$15.00	~270K
Anthropic	Claude Opus 4.7	$5.00	$25.00	200K
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	1M (flat)
Google	Gemini 3.1 Pro Preview	$2.00 / $4.00*	$12.00 / $18.00*	1M
Google	Gemini 2.5 Pro	$1.25 / $2.50*	$10.00 / $15.00*	1M
xAI	grok-4.3	$1.25	$2.50	1M
xAI	grok-4.20 (reasoning)	$1.25	$2.50	2M
Mistral	Mistral Medium 3.5	$1.50	$7.50	128K
Cohere	Command R+ (08-2024)	$2.50	$10.00	128K

*Google tiered pricing: lower number for prompts up to 200K input tokens, higher for prompts above 200K.

Cost-Leader Tier Comparison (Standard Pricing, USD per 1M Tokens)

Vendor	Model	Input	Output	Context
Anthropic	Claude Haiku 4.5	$1.00	$5.00	200K
OpenAI	GPT-5.4 Mini	$0.75	$4.50	~270K
Google	Gemini 3 Flash Preview	$0.50	$3.00	1M
Mistral	Mistral Large 3	$0.50	$1.50	128K
Mistral	Mistral Medium 3	$0.40	$2.00	128K
Google	Gemini 3.1 Flash-Lite	$0.25	$1.50	1M
OpenAI	GPT-5.4 Nano	$0.20	$1.25	~270K
xAI	grok-4-1-fast	$0.20	$0.50	2M
DeepSeek	DeepSeek-V4-Flash	$0.14	$0.28	1M
Google	Gemini 2.5 Flash-Lite	$0.10	$0.40	1M
OpenAI	GPT-4.1 Nano	$0.10	$0.40	~270K

Inference-Speed Specialists (Groq, OSS Models)

Groq hosts open-weight models on custom LPU hardware and prices on tokens, not seconds. The headline figure for many workloads is throughput (tokens per second) rather than dollars per token.

Model	Input	Output	Throughput
Llama 4 Scout (17B x 16E)	$0.11	$0.34	594 tps
Llama 3.3 70B Versatile	$0.59	$0.79	394 tps
Llama 3.1 8B Instant	$0.05	$0.08	840 tps
GPT-OSS 120B (128K)	$0.15	$0.60	500 tps
Qwen3 32B (131K)	$0.29	$0.59	662 tps

Five Things the Table Tells You

Frontier-to-frugal spans roughly 300x at the output token. GPT-5.5 Pro output is $180 per million; Gemini 2.5 Flash-Lite output is $0.40. The same workload routed to the right model can be three orders of magnitude cheaper.
Output costs 4-6x input across most vendors. The output multiplier is the dominant cost lever. Workloads that read a lot and write a little (summarization, classification, extraction) are dramatically cheaper per request than workloads that write a lot (long-form generation, agentic loops).
Sonnet 4.6 is the only flagship with a flat 1M-token context. Google charges roughly 2x above 200K input on Gemini 2.5 Pro and Gemini 3.1 Pro; OpenAI's long-context rates apply above ~270K. Anthropic's flat-rate 1M tier removes that surcharge planning step.
DeepSeek-V4-Flash redefined the cost floor. At $0.14 input / $0.28 output with a 1M context, V4-Flash is roughly 50x cheaper than GPT-5.5 on input. Cache-hit pricing on V4-Flash falls to $0.0028 per million input tokens, which is functionally free for read-heavy retrieval workloads.
Caching and batch discounts are the second cost lever after model choice. Anthropic batch processing halves token costs and prompt caching reduces cached input by 90 percent. DeepSeek-V4-Pro's temporary 75 percent discount (effective through May 31, 2026) and Google's context caching offer comparable structural savings. Workloads with stable system prompts or retrieval contexts see the largest gains.

What This Means for AI Visibility and Brand-Recommendation Workloads

The model that a downstream agent uses to evaluate brand options has a direct effect on which brands surface. Cheap models (Flash-Lite, V4-Flash, Nano tiers) handle bulk routing and lightweight extraction; flagship models (Opus 4.7, GPT-5.5, Gemini 3.1 Pro) handle reasoning-heavy comparison and final recommendations. Brands optimising for visibility should test their visibility across both tiers because cheap models increasingly run upstream of the flagship that produces the consumer-visible answer.

Methodology

Pricing data collected May 14, 2026 from vendor sources: OpenAI API pricing, Anthropic Claude API pricing, Gemini API pricing, Mistral pricing, DeepSeek pricing, xAI Grok models, Cohere pricing, and Groq pricing. All figures are USD per million tokens at standard pay-as-you-go rates. Long-context surcharges and tiered rates are flagged where the vendor publishes them. Pricing is refreshed quarterly. Treat figures as accurate at time of capture; verify against the vendor source before committing to enterprise spend.

How Presenc AI Helps

Presenc AI monitors brand-recommendation outputs across the major AI platforms, including the flagship and cost-leader tiers in the tables above. When a brand's visibility shifts because an upstream model swap routes more traffic through Gemini 2.5 Flash or DeepSeek-V4-Flash rather than the flagship, that change shows up in our platform-by-platform tracking. For brands building AI visibility strategy, the relevant question is not just which model is cheapest, but which models the recommendation pipeline actually uses end-to-end.