What this is
Open-source computer-use agents reached rough parity with the proprietary frontier in 2026. OpenCUA-72B, from the xlang-ai group, hit 45.0% on OSWorld-Verified, comparable to Claude 4 Sonnet, and a state-of-the-art on UI-Vision at 37.4%. This page is a 2026-05-15 head-to-head snapshot.
Benchmark Comparison
| Model | OSWorld-Verified | ScreenSpot-Pro | UI-Vision |
|---|---|---|---|
| OpenCUA-72B | 45.0% (SOTA open) | 60.8% | 37.4% (SOTA) |
| OpenCUA-32B | Surpassed GPT-4o-based CUA | ~55% | ~32% |
| Claude 4 Sonnet (computer use) | ~46-48% | ~62% | n/a |
| OpenAI CUA (GPT-4o-based) | ~38% | ~52% | ~28% |
| OpenAI GPT-5.5 CUA | ~49% | ~64% | ~36% |
| Gemini 2.5 (computer use) | ~40% | ~57% | ~31% |
Architectural Differences
| Dimension | OpenCUA | Claude Computer Use |
|---|---|---|
| License | Open weights + open framework | Proprietary API |
| Training corpus | AgentNet (22,600+ task demonstrations) | Undisclosed |
| Reasoning approach | Chain-of-thought "inner monologue" | Implicit reasoning + tool calls |
| OS coverage | Windows, macOS, Ubuntu | Sandbox VM + bring-your-own host |
| App breadth | 200+ applications + websites | Anything visible to display |
| Default deployment | Self-host (HuggingFace, local) | Anthropic API |
| Tool action set | computer + browser + system actions | computer_use_20251124 (with zoom) |
AgentNet Dataset (OpenCUA's training corpus)
| Attribute | Value |
|---|---|
| Task demonstrations | 22,600+ |
| Operating systems | Windows, macOS, Ubuntu |
| Applications + websites covered | 200+ |
| Innovation | Trajectories augmented with chain-of-thought |
Six Things the Data Tells You
- Open weights matched Claude 4 Sonnet on computer use. The OpenCUA-72B / Claude 4 Sonnet gap on OSWorld-Verified is within noise.
- Open weights still trail GPT-5.5 on computer use by ~4 points on OSWorld-Verified.
- The CoT "inner monologue" trick is the OpenCUA innovation. Open frameworks now have a published recipe for hitting frontier-level computer-use performance.
- AgentNet (22.6K demonstrations) is the open data asset to beat. Comparable proprietary datasets are not publicly disclosed.
- Enterprise on-prem can ship now. OpenCUA-72B running on a single multi-GPU host is competitive with Claude 4 Sonnet for many computer-use tasks.
- The proprietary frontier still wins on UX and safety. Anthropic and OpenAI ship safer defaults, better tool-use rate-limiting, and clearer terms.
What This Means for AI Visibility
If open-source computer-use agents reach Claude 4 Sonnet parity, they will be the surface enterprises deploy for on-prem AI workflows, including agentic commerce. Brands that want to be reachable to these agents need to test agent-reachability against both proprietary (Claude, ChatGPT, Gemini) and open agent stacks (OpenCUA, browser-use), because the actual install base diverges.
Methodology
Benchmark figures sourced from VentureBeat's OpenCUA coverage, the OpenCUA project page, the OpenCUA arXiv paper, and the OpenCUA GitHub repository. Claude / OpenAI / Gemini comparison figures cross-checked against Coasty's computer-use agent comparison.
How Presenc AI Helps
Presenc AI runs agent-reachability tests against both proprietary computer-use agents (Claude, GPT-5.5, Gemini) and open-source stacks (OpenCUA, browser-use). Brands that need consistent agent reachability across the install base — not just the surface their team uses internally — get a true picture of how they appear inside agentic workflows.