Research

LLM API Pricing Comparison May 2026

Per-million-token pricing for frontier LLM APIs in May 2026. OpenAI GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, DeepSeek V4, Grok 4.3, Mistral, Cohere, and Groq compared side-by-side with context windows and caching discounts.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Frontier LLM API Pricing at a Glance, May 2026

This page consolidates published per-million-token API pricing for the major frontier LLMs in May 2026. Numbers below are pulled from each vendor's own pricing page or developer documentation and represent standard pay-as-you-go rates, not committed-use, batch, or enterprise pricing. Caching, batch, and long-context surcharges are noted where the vendor publishes them.

Frontier Tier Comparison (Standard Pricing, USD per 1M Tokens)

VendorModelInputOutputContext
OpenAIGPT-5.5 Pro$30.00$180.00~270K
OpenAIGPT-5.5$5.00$30.00~270K
OpenAIGPT-5.4$2.50$15.00~270K
AnthropicClaude Opus 4.7$5.00$25.00200K
AnthropicClaude Sonnet 4.6$3.00$15.001M (flat)
GoogleGemini 3.1 Pro Preview$2.00 / $4.00*$12.00 / $18.00*1M
GoogleGemini 2.5 Pro$1.25 / $2.50*$10.00 / $15.00*1M
xAIgrok-4.3$1.25$2.501M
xAIgrok-4.20 (reasoning)$1.25$2.502M
MistralMistral Medium 3.5$1.50$7.50128K
CohereCommand R+ (08-2024)$2.50$10.00128K

*Google tiered pricing: lower number for prompts up to 200K input tokens, higher for prompts above 200K.

Cost-Leader Tier Comparison (Standard Pricing, USD per 1M Tokens)

VendorModelInputOutputContext
AnthropicClaude Haiku 4.5$1.00$5.00200K
OpenAIGPT-5.4 Mini$0.75$4.50~270K
GoogleGemini 3 Flash Preview$0.50$3.001M
MistralMistral Large 3$0.50$1.50128K
MistralMistral Medium 3$0.40$2.00128K
GoogleGemini 3.1 Flash-Lite$0.25$1.501M
OpenAIGPT-5.4 Nano$0.20$1.25~270K
xAIgrok-4-1-fast$0.20$0.502M
DeepSeekDeepSeek-V4-Flash$0.14$0.281M
GoogleGemini 2.5 Flash-Lite$0.10$0.401M
OpenAIGPT-4.1 Nano$0.10$0.40~270K

Inference-Speed Specialists (Groq, OSS Models)

Groq hosts open-weight models on custom LPU hardware and prices on tokens, not seconds. The headline figure for many workloads is throughput (tokens per second) rather than dollars per token.

ModelInputOutputThroughput
Llama 4 Scout (17B x 16E)$0.11$0.34594 tps
Llama 3.3 70B Versatile$0.59$0.79394 tps
Llama 3.1 8B Instant$0.05$0.08840 tps
GPT-OSS 120B (128K)$0.15$0.60500 tps
Qwen3 32B (131K)$0.29$0.59662 tps

Five Things the Table Tells You

  1. Frontier-to-frugal spans roughly 300x at the output token. GPT-5.5 Pro output is $180 per million; Gemini 2.5 Flash-Lite output is $0.40. The same workload routed to the right model can be three orders of magnitude cheaper.
  2. Output costs 4-6x input across most vendors. The output multiplier is the dominant cost lever. Workloads that read a lot and write a little (summarization, classification, extraction) are dramatically cheaper per request than workloads that write a lot (long-form generation, agentic loops).
  3. Sonnet 4.6 is the only flagship with a flat 1M-token context. Google charges roughly 2x above 200K input on Gemini 2.5 Pro and Gemini 3.1 Pro; OpenAI's long-context rates apply above ~270K. Anthropic's flat-rate 1M tier removes that surcharge planning step.
  4. DeepSeek-V4-Flash redefined the cost floor. At $0.14 input / $0.28 output with a 1M context, V4-Flash is roughly 50x cheaper than GPT-5.5 on input. Cache-hit pricing on V4-Flash falls to $0.0028 per million input tokens, which is functionally free for read-heavy retrieval workloads.
  5. Caching and batch discounts are the second cost lever after model choice. Anthropic batch processing halves token costs and prompt caching reduces cached input by 90 percent. DeepSeek-V4-Pro's temporary 75 percent discount (effective through May 31, 2026) and Google's context caching offer comparable structural savings. Workloads with stable system prompts or retrieval contexts see the largest gains.

What This Means for AI Visibility and Brand-Recommendation Workloads

The model that a downstream agent uses to evaluate brand options has a direct effect on which brands surface. Cheap models (Flash-Lite, V4-Flash, Nano tiers) handle bulk routing and lightweight extraction; flagship models (Opus 4.7, GPT-5.5, Gemini 3.1 Pro) handle reasoning-heavy comparison and final recommendations. Brands optimising for visibility should test their visibility across both tiers because cheap models increasingly run upstream of the flagship that produces the consumer-visible answer.

Methodology

Pricing data collected May 14, 2026 from vendor sources: OpenAI API pricing, Anthropic Claude API pricing, Gemini API pricing, Mistral pricing, DeepSeek pricing, xAI Grok models, Cohere pricing, and Groq pricing. All figures are USD per million tokens at standard pay-as-you-go rates. Long-context surcharges and tiered rates are flagged where the vendor publishes them. Pricing is refreshed quarterly. Treat figures as accurate at time of capture; verify against the vendor source before committing to enterprise spend.

How Presenc AI Helps

Presenc AI monitors brand-recommendation outputs across the major AI platforms, including the flagship and cost-leader tiers in the tables above. When a brand's visibility shifts because an upstream model swap routes more traffic through Gemini 2.5 Flash or DeepSeek-V4-Flash rather than the flagship, that change shows up in our platform-by-platform tracking. For brands building AI visibility strategy, the relevant question is not just which model is cheapest, but which models the recommendation pipeline actually uses end-to-end.

Frequently Asked Questions

Among cost-leader frontier-quality models, DeepSeek-V4-Flash at $0.14 input / $0.28 output per million tokens is the floor with a 1M-token context. Among major US vendors, Gemini 2.5 Flash-Lite ($0.10 input / $0.40 output) and GPT-4.1 Nano (same) lead the cost-leader tier. The cheapest reasoning-capable model from a major US vendor is GPT-5.4 Mini at $0.75 / $4.50.
Output tokens require a forward pass per token (autoregressive decoding), while input tokens are processed in parallel through prefill. Decoding is the bottleneck on inference hardware. The 4-6x input-to-output multiplier across vendors reflects this asymmetry plus the higher value of generation versus parsing.
Most major vendors offer both. Anthropic batch processing is approximately 50 percent off and prompt caching reduces cached input by approximately 90 percent. OpenAI offers batch and flex pricing at approximately 50 percent off. Google offers context caching with similar savings. DeepSeek-V4-Flash cache-hit pricing falls to $0.0028 per million input tokens. Workloads with stable system prompts or large retrieved contexts benefit the most.
Gemini 2.5 Pro and Gemini 3.1 Pro Preview charge roughly 2x more for prompts above 200K input tokens. Below 200K, Gemini 2.5 Pro is $1.25 input; above 200K it becomes $2.50. Plan request shapes accordingly, especially for RAG workloads that may approach the threshold. Anthropic Sonnet 4.6 is the only flagship with a flat 1M-token context rate.
No. Frontier LLM pricing has dropped roughly 80-90 percent on capability-equivalent models since GPT-4 launch in March 2023. Expect quarterly downward revisions on cost-leader tiers, occasional new flagship launches at high prices, and irregular temporary discounts (DeepSeek-V4-Pro has a 75 percent discount through May 31, 2026 in this snapshot). Refresh pricing before any multi-quarter cost projection.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.