Research

LMSYS Chatbot Arena Elo Rankings May 2026

Live LMSYS / LM Arena Chatbot Arena Elo leaderboard for May 2026. Top 25 models from Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, Alibaba, Baidu, and others with confidence intervals and vote counts.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

What the Crowd Actually Prefers in May 2026

The LMSYS Chatbot Arena (now hosted as arena.ai, formerly chat.lmsys.org) is the most-cited blind human preference benchmark in AI. Users compare two anonymous model responses side by side and pick a winner; the Bradley-Terry Elo system aggregates roughly 6 million pairwise votes accumulated since launch in May 2023. This page captures the top 25 of the Text leaderboard as of May 14, 2026, with vote counts and confidence intervals.

Top 25 Text Leaderboard (May 14, 2026)

RankModelOrgElo±Votes
1claude-opus-4-6-thinkingAnthropic1502524,925
2claude-opus-4-7-thinkingAnthropic1501610,413
3claude-opus-4-6Anthropic1498426,459
4claude-opus-4-7Anthropic1492611,006
5muse-sparkMeta14916Preliminary
6gemini-3.1-pro-previewGoogle1490431,012
7gemini-3-proGoogle1486441,339
8gpt-5.5-highOpenAI148477,877
9grok-4.20-beta1xAI1479520,258
10gpt-5.4-highOpenAI1479518,521
11gpt-5.2-chat-latest-20260210OpenAI1477425,130
12grok-4.20-beta-0309-reasoningxAI1477518,895
13gpt-5.5OpenAI147677,982
14grok-4.20-multi-agent-beta-0309xAI1474519,137
15gemini-3-flashGoogle1474430,753
16ernie-5.1Baidu147376,949
17claude-opus-4-5-20251101-thinking-32kAnthropic1473437,127
18gpt-5.5-instantOpenAI147284,927
19glm-5.1Z.ai1471611,485
20claude-opus-4-5-20251101Anthropic1468356,217
21grok-4.1-thinkingxAI1467356,685
22claude-sonnet-4-6Anthropic1467518,529
23gpt-5.4OpenAI1467519,364
24mimo-v2.5-proXiaomi146577,476
25qwen3.5-max-previewAlibaba1465515,533

Vendor Share of the Top 25

VendorModels in Top 25Top Rank
Anthropic7#1 (Opus 4.6 Thinking, 1502)
OpenAI6#8 (GPT-5.5-high, 1484)
xAI (Grok)4#9 (Grok-4.20-beta1, 1479)
Google3#6 (Gemini-3.1-Pro-Preview, 1490)
Chinese vendors (Baidu, Z.ai, Xiaomi, Alibaba)4#16 (ERNIE-5.1, 1473)
Meta1#5 (Muse Spark, 1491 preliminary)

Six Things the Rankings Tell You

  1. Anthropic holds 4 of the top 5 slots. Claude Opus 4.6 and Opus 4.7 (both regular and thinking variants) dominate the leaderboard top, with only Meta's preliminary Muse Spark breaking up the cluster. As of this snapshot, Anthropic is the human-preference leader by a margin not closely matched since Claude 3 Opus briefly led in 2024.
  2. Thinking-mode adds roughly 3-6 Elo. Pairing Claude Opus 4.6 vs 4.6-thinking (1498 vs 1502), Claude Opus 4.7 vs 4.7-thinking (1492 vs 1501), and GPT-5.5 vs GPT-5.5-high (1476 vs 1484) shows the pattern. The premium is small in absolute Elo but consistent across vendors that publish both variants.
  3. OpenAI has the most models in the top 25 but the lowest top rank. Six OpenAI entries (GPT-5.5-high, 5.5, 5.5-instant, 5.4-high, 5.4, 5.2-chat-latest) span ranks 8-23 but no GPT model breaks into the top 5. The Anthropic-OpenAI gap is roughly 10-20 Elo points at the top of each vendor's stack.
  4. Chinese vendors hold four top-25 slots and are gaining ground. Baidu ERNIE 5.1, Z.ai GLM 5.1, Xiaomi MiMo V2.5 Pro, and Alibaba Qwen 3.5 Max Preview occupy ranks 16, 19, 24, and 25. All four are clustered in the 1465-1473 range with overlapping confidence intervals. The frontier-vs-Chinese gap is roughly 30 Elo points (1502 vs 1473).
  5. Meta's Muse Spark surprises at #5 with preliminary votes. 1491 Elo with the "Preliminary" votes label means Meta is shipping a frontier-grade model under a new brand, distinct from the Llama family. Once vote count converges, the confidence interval will tighten and the position may shift, but the entry signal is material.
  6. Confidence intervals tighten dramatically with votes. Claude Opus 4.5 at 56,217 votes has ±3 Elo. GPT-5.5-instant at 4,927 votes has ±8 Elo. For new model launches that ship with fewer than 10K votes, treat the absolute Elo as provisional and watch vote count alongside rank when deciding whether a leaderboard move is real.

What This Means for AI Visibility

Arena Elo correlates with the model a consumer-facing assistant chooses to default users to, because vendors anchor pricing tiers, marketing claims, and default-routing decisions to leaderboard position. When a brand's mention rate shifts on Claude before it shifts on GPT, that often traces back to a leaderboard rank change that prompted user migration. Brands building AI visibility strategy should monitor the Arena leaderboard alongside their own platform-specific mention-rate tracking, because the leaderboard movement leads consumer behavior shifts by roughly 30-60 days.

Methodology

Rankings, Elo scores, confidence intervals, and vote counts pulled from arena.ai/leaderboard/text/ on May 14, 2026. arena.ai is the rebranded operational home of the original LMSYS Chatbot Arena, run by lmarena-ai (formerly the LMSYS organisation at UC Berkeley). Elo computed via Bradley-Terry pairwise model on roughly 6 million accumulated user votes. Confidence intervals reflect 95 percent bounds; "Preliminary" indicates that a model has not yet accumulated enough votes for a tight interval. Leaderboard refreshes weekly; treat this snapshot as accurate at time of capture and re-check arena.ai for the current state.

How Presenc AI Helps

Presenc AI tracks brand-mention rates across the major AI platforms whose underlying models are ranked above. The Arena leaderboard tells you which model wins the consumer's blind preference vote; Presenc AI tells you which brand wins inside those models' recommendations. When a model rises on the leaderboard, the brand-visibility outcomes on its hosted assistant typically follow within a quarter, which makes leaderboard-tracking a leading indicator for AI-visibility teams.

Frequently Asked Questions

Anthropic's Claude Opus 4.6 Thinking variant leads at 1,502 Elo. The top 5 are: Claude Opus 4.6 Thinking (1502), Claude Opus 4.7 Thinking (1501), Claude Opus 4.6 (1498), Claude Opus 4.7 (1492), and Meta Muse Spark preview (1491). Anthropic holds four of the top five slots.
GPT-5.5-high is at #8 (1484 Elo) and Gemini 3.1 Pro Preview is at #6 (1490). Gemini 3 Pro and Gemini 3 Flash are at #7 (1486) and #15 (1474) respectively. The OpenAI flagship is approximately 18 Elo behind the top Claude model; the Google flagship is approximately 12 Elo behind.
Bradley-Terry is a statistical model for pairwise preference data: given a set of head-to-head votes, it estimates each item's strength on a single scale (here, Elo). The Arena uses it because blind side-by-side voting maps naturally to pairwise outcomes, and Elo's Bradley-Terry interpretation gives a model-comparison number that is robust to opponent quality. A 100-Elo gap implies the higher model wins approximately 64 percent of pairwise comparisons.
Yes, but at a 30 Elo gap from the leaders. Four Chinese vendors hold top-25 slots in May 2026: Baidu ERNIE 5.1 (1473), Z.ai GLM 5.1 (1471), Xiaomi MiMo V2.5 Pro (1465), and Alibaba Qwen 3.5 Max Preview (1465). They cluster in the 1465-1473 band while the leaders sit at 1490-1502. The gap has narrowed over the past 12 months.
Imperfectly. Arena measures blind human preference on chat-style prompts, which correlates well with general-purpose conversation quality and reasonably well with creative writing and casual coding. It correlates less well with specialised tasks (long-form coding, agentic tool use, structured extraction) which have their own benchmarks. Use Arena Elo as a top-of-funnel filter for "is this model frontier-grade," then validate with task-specific benchmarks for your workload.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.