Which AI model leads the LMSYS Chatbot Arena in May 2026?

Anthropic's Claude Opus 4.6 Thinking variant leads at 1,502 Elo. The top 5 are: Claude Opus 4.6 Thinking (1502), Claude Opus 4.7 Thinking (1501), Claude Opus 4.6 (1498), Claude Opus 4.7 (1492), and Meta Muse Spark preview (1491). Anthropic holds four of the top five slots.

Where do GPT-5.5 and Gemini 3 rank?

GPT-5.5-high is at #8 (1484 Elo) and Gemini 3.1 Pro Preview is at #6 (1490). Gemini 3 Pro and Gemini 3 Flash are at #7 (1486) and #15 (1474) respectively. The OpenAI flagship is approximately 18 Elo behind the top Claude model; the Google flagship is approximately 12 Elo behind.

What is Bradley-Terry Elo and why does Arena use it?

Bradley-Terry is a statistical model for pairwise preference data: given a set of head-to-head votes, it estimates each item's strength on a single scale (here, Elo). The Arena uses it because blind side-by-side voting maps naturally to pairwise outcomes, and Elo's Bradley-Terry interpretation gives a model-comparison number that is robust to opponent quality. A 100-Elo gap implies the higher model wins approximately 64 percent of pairwise comparisons.

Are Chinese models competitive on the Arena?

Yes, but at a 30 Elo gap from the leaders. Four Chinese vendors hold top-25 slots in May 2026: Baidu ERNIE 5.1 (1473), Z.ai GLM 5.1 (1471), Xiaomi MiMo V2.5 Pro (1465), and Alibaba Qwen 3.5 Max Preview (1465). They cluster in the 1465-1473 band while the leaders sit at 1490-1502. The gap has narrowed over the past 12 months.

Does Arena Elo translate to real-world performance?

Imperfectly. Arena measures blind human preference on chat-style prompts, which correlates well with general-purpose conversation quality and reasonably well with creative writing and casual coding. It correlates less well with specialised tasks (long-form coding, agentic tool use, structured extraction) which have their own benchmarks. Use Arena Elo as a top-of-funnel filter for "is this model frontier-grade," then validate with task-specific benchmarks for your workload.

LMSYS Chatbot Arena Elo Rankings May 2026

What the Crowd Actually Prefers in May 2026

The LMSYS Chatbot Arena (now hosted as arena.ai, formerly chat.lmsys.org) is the most-cited blind human preference benchmark in AI. Users compare two anonymous model responses side by side and pick a winner; the Bradley-Terry Elo system aggregates roughly 6 million pairwise votes accumulated since launch in May 2023. This page captures the top 25 of the Text leaderboard as of May 14, 2026, with vote counts and confidence intervals.

Top 25 Text Leaderboard (May 14, 2026)

Rank	Model	Org	Elo	±	Votes
1	claude-opus-4-6-thinking	Anthropic	1502	5	24,925
2	claude-opus-4-7-thinking	Anthropic	1501	6	10,413
3	claude-opus-4-6	Anthropic	1498	4	26,459
4	claude-opus-4-7	Anthropic	1492	6	11,006
5	muse-spark	Meta	1491	6	Preliminary
6	gemini-3.1-pro-preview	Google	1490	4	31,012
7	gemini-3-pro	Google	1486	4	41,339
8	gpt-5.5-high	OpenAI	1484	7	7,877
9	grok-4.20-beta1	xAI	1479	5	20,258
10	gpt-5.4-high	OpenAI	1479	5	18,521
11	gpt-5.2-chat-latest-20260210	OpenAI	1477	4	25,130
12	grok-4.20-beta-0309-reasoning	xAI	1477	5	18,895
13	gpt-5.5	OpenAI	1476	7	7,982
14	grok-4.20-multi-agent-beta-0309	xAI	1474	5	19,137
15	gemini-3-flash	Google	1474	4	30,753
16	ernie-5.1	Baidu	1473	7	6,949
17	claude-opus-4-5-20251101-thinking-32k	Anthropic	1473	4	37,127
18	gpt-5.5-instant	OpenAI	1472	8	4,927
19	glm-5.1	Z.ai	1471	6	11,485
20	claude-opus-4-5-20251101	Anthropic	1468	3	56,217
21	grok-4.1-thinking	xAI	1467	3	56,685
22	claude-sonnet-4-6	Anthropic	1467	5	18,529
23	gpt-5.4	OpenAI	1467	5	19,364
24	mimo-v2.5-pro	Xiaomi	1465	7	7,476
25	qwen3.5-max-preview	Alibaba	1465	5	15,533

Vendor Share of the Top 25

Vendor	Models in Top 25	Top Rank
Anthropic	7	#1 (Opus 4.6 Thinking, 1502)
OpenAI	6	#8 (GPT-5.5-high, 1484)
xAI (Grok)	4	#9 (Grok-4.20-beta1, 1479)
Google	3	#6 (Gemini-3.1-Pro-Preview, 1490)
Chinese vendors (Baidu, Z.ai, Xiaomi, Alibaba)	4	#16 (ERNIE-5.1, 1473)
Meta	1	#5 (Muse Spark, 1491 preliminary)

Six Things the Rankings Tell You

Anthropic holds 4 of the top 5 slots. Claude Opus 4.6 and Opus 4.7 (both regular and thinking variants) dominate the leaderboard top, with only Meta's preliminary Muse Spark breaking up the cluster. As of this snapshot, Anthropic is the human-preference leader by a margin not closely matched since Claude 3 Opus briefly led in 2024.
Thinking-mode adds roughly 3-6 Elo. Pairing Claude Opus 4.6 vs 4.6-thinking (1498 vs 1502), Claude Opus 4.7 vs 4.7-thinking (1492 vs 1501), and GPT-5.5 vs GPT-5.5-high (1476 vs 1484) shows the pattern. The premium is small in absolute Elo but consistent across vendors that publish both variants.
OpenAI has the most models in the top 25 but the lowest top rank. Six OpenAI entries (GPT-5.5-high, 5.5, 5.5-instant, 5.4-high, 5.4, 5.2-chat-latest) span ranks 8-23 but no GPT model breaks into the top 5. The Anthropic-OpenAI gap is roughly 10-20 Elo points at the top of each vendor's stack.
Chinese vendors hold four top-25 slots and are gaining ground. Baidu ERNIE 5.1, Z.ai GLM 5.1, Xiaomi MiMo V2.5 Pro, and Alibaba Qwen 3.5 Max Preview occupy ranks 16, 19, 24, and 25. All four are clustered in the 1465-1473 range with overlapping confidence intervals. The frontier-vs-Chinese gap is roughly 30 Elo points (1502 vs 1473).
Meta's Muse Spark surprises at #5 with preliminary votes. 1491 Elo with the "Preliminary" votes label means Meta is shipping a frontier-grade model under a new brand, distinct from the Llama family. Once vote count converges, the confidence interval will tighten and the position may shift, but the entry signal is material.
Confidence intervals tighten dramatically with votes. Claude Opus 4.5 at 56,217 votes has ±3 Elo. GPT-5.5-instant at 4,927 votes has ±8 Elo. For new model launches that ship with fewer than 10K votes, treat the absolute Elo as provisional and watch vote count alongside rank when deciding whether a leaderboard move is real.

What This Means for AI Visibility

Arena Elo correlates with the model a consumer-facing assistant chooses to default users to, because vendors anchor pricing tiers, marketing claims, and default-routing decisions to leaderboard position. When a brand's mention rate shifts on Claude before it shifts on GPT, that often traces back to a leaderboard rank change that prompted user migration. Brands building AI visibility strategy should monitor the Arena leaderboard alongside their own platform-specific mention-rate tracking, because the leaderboard movement leads consumer behavior shifts by roughly 30-60 days.

Methodology

Rankings, Elo scores, confidence intervals, and vote counts pulled from arena.ai/leaderboard/text/ on May 14, 2026. arena.ai is the rebranded operational home of the original LMSYS Chatbot Arena, run by lmarena-ai (formerly the LMSYS organisation at UC Berkeley). Elo computed via Bradley-Terry pairwise model on roughly 6 million accumulated user votes. Confidence intervals reflect 95 percent bounds; "Preliminary" indicates that a model has not yet accumulated enough votes for a tight interval. Leaderboard refreshes weekly; treat this snapshot as accurate at time of capture and re-check arena.ai for the current state.

How Presenc AI Helps

Presenc AI tracks brand-mention rates across the major AI platforms whose underlying models are ranked above. The Arena leaderboard tells you which model wins the consumer's blind preference vote; Presenc AI tells you which brand wins inside those models' recommendations. When a model rises on the leaderboard, the brand-visibility outcomes on its hosted assistant typically follow within a quarter, which makes leaderboard-tracking a leading indicator for AI-visibility teams.