Research

AI Agent Evaluation and RLHF Startups, May 2026

The funded startups building evaluation, RLHF, and data-labelling infrastructure for AI agents in 2026. Scale AI, Surge AI, Toloka, Argilla, Patronus AI, Confident AI, and the agent-eval supply chain.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

The Evaluation Supply Chain for AI Agents in 2026

Every AI agent in production depends on an evaluation supply chain: human-labelled examples for fine-tuning, automated eval frameworks for regression testing, and RLHF (reinforcement learning from human feedback) infrastructure for ongoing model improvement. The companies that supply this chain are some of the most heavily funded in the AI sector, and the category has consolidated rapidly through 2025-2026. This page consolidates the major eval and RLHF startups, their funding, and their market positioning.

Data Labelling and RLHF Service Providers

CompanyFundingValuation / Notes
Scale AI~$1.6B cumulative (Meta acquired 49% for $14.3B, June 2025)~$29B Meta-implied valuation; primary labelling provider for OpenAI, Anthropic, Meta
Surge AIBootstrapped; reported revenue $1B+ in 2024Anthropic primary labelling partner; bootstrapped against Scale's VC-funded model
TolokaOwned by Yandex; growing AI-eval businessMajor non-Scale alternative for European and global labelling
Mercor$100M+ cumulativeAI-marketplace-for-experts model; rapid growth in 2025-2026
Snorkel AI~$135M cumulativeProgrammatic labelling; enterprise focus

Evaluation Framework and Eval-as-a-Service Startups

CompanyFundingFocus
Patronus AI~$50M cumulativeHallucination + factuality evaluation; FT Series B
Galileo~$45M cumulativeRAG + agent evaluation focus
Confident AI (DeepEval)~$5M cumulativeOpen-source DeepEval framework + cloud
Braintrust$120M cumulative ($800M valuation)Eval framework + observability integrated (covered separately)
ArgillaAcquired by Hugging Face Q4 2024Open-source labelling and eval; HF-bundled
LightOn~$25M cumulativeEuropean eval and fine-tuning platform

Six Things the Eval-Startup Landscape Tells You

  1. Meta's $14.3B Scale AI acquisition reset the category. Meta took 49 percent of Scale AI in June 2025 at an implied $29 billion valuation, the largest single AI-data-services deal in history. The deal locked up Scale's capacity for Meta and forced competing labs (OpenAI, Anthropic, Google) to diversify labelling providers, expanding the addressable market for Surge, Mercor, and Toloka.
  2. Surge AI's bootstrapped path is now the cautionary success story. Reported $1B+ revenue in 2024 without venture capital validates the "pure-services not platform" approach for labelling. Anthropic is the primary customer. Whether Surge stays bootstrapped through scale is the structural question for 2026-2027.
  3. Patronus AI is the funded eval-specialist leader. $50M cumulative funding focused on hallucination and factuality evaluation. The thesis: as agents are deployed in regulated and high-stakes settings, evaluation against hallucination becomes structural, not optional.
  4. Mercor scaled fastest in 2025-2026. The "AI marketplace for experts" positioning (matching domain experts to labelling and RLHF tasks) is differentiated from the Scale / Surge mass-labelling approach. Mercor's growth reflects the shift toward higher-quality, lower-volume labelling for frontier model fine-tunes.
  5. Hugging Face's Argilla acquisition opens up eval-bundling. Argilla acquired by HF in Q4 2024 means HF can offer integrated labelling and eval inside its model hosting platform. Competitors (Snorkel, LightOn) now must differentiate against an HF-bundled option.
  6. Open-source eval frameworks have caught on. DeepEval (Confident AI's framework), Argilla, and LangSmith's eval functionality are all open or freely available. Cloud commercial tiers monetise on top of open-source foundations. The pattern echoes observability: free open-source + paid cloud.

What This Means for AI Visibility

Eval and labelling startups themselves rarely appear in consumer AI visibility tracking, but they are critical to two B2B segments: AI labs (OpenAI, Anthropic, Meta, Google) and AI-native companies running their own fine-tunes. Brands selling into either segment (security, billing, dev tooling, cloud) should track visibility within these companies' buyer profiles. The eval-supply-chain layer is small in pure-revenue terms but has outsized influence on which models exist and which capabilities they have, which in turn shapes downstream brand-visibility outcomes for everyone.

Methodology

Funding and acquisition data collected May 15, 2026 from Crunchbase, PitchBook, Reuters and Financial Times coverage of the Scale-Meta deal, and vendor websites. Revenue figures where reported are vendor self-disclosures and should be treated as directional. Refreshed quarterly.

How Presenc AI Helps

Presenc AI tracks brand visibility inside AI labs and AI-native companies' buyer demographics. For brands selling into the eval-supply-chain segment specifically, the buyer universe is concentrated (~10-50 companies globally) and visibility inside that universe is the operational signal that connects pipeline investment to deal flow.

Frequently Asked Questions

Meta took 49 percent of Scale AI in June 2025 at an implied $29 billion valuation, the largest single AI-data-services deal in history. Practical effect: Meta gets priority access to Scale's labelling and RLHF capacity for its own model fine-tunes (Llama family and Muse Spark). Competing AI labs (OpenAI, Anthropic, Google) have diversified their labelling providers in response, expanding the addressable market for Surge AI, Mercor, and Toloka.
Surge AI, by public reporting. Surge is bootstrapped (no venture capital) and reportedly crossed $1B in revenue in 2024. Anthropic's preference for Surge reflects the alignment-focused labelling quality that Surge emphasises, distinct from the higher-volume mass-labelling approach Scale was historically known for.
$50M cumulative funding focused specifically on hallucination and factuality evaluation. The market thesis: as AI agents deploy into regulated industries (finance, healthcare, legal), measurable hallucination rates become structural compliance requirements rather than optional quality metrics. Patronus's eval framework + benchmark library is the funded category leader in that specific cut.
Yes, increasingly. DeepEval (Confident AI's framework), Argilla (now bundled with Hugging Face), and LangSmith's eval functionality cover most production evaluation needs at no cost. Commercial cloud tiers add scale, multi-tenant management, and integration polish. The pattern mirrors observability: open-source foundations with commercial cloud monetisation on top.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.