Research

Open-Weight Function Calling and Tool Use 2026

Open-weight function-calling and tool-use models 2026: Salesforce xLAM, NexusRaven, Hermes-Pro, Watt-Tool, plus general models with strong tool use. BFCL leaderboard, deployment patterns.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Tool use and function calling became the dominant pattern for production AI agents in 2025-2026. The Berkeley Function-Calling Leaderboard (BFCL) tracks the state of the art across closed and open models. Specialised open-weight tool-use families (Salesforce xLAM, NexusRaven, Hermes-Pro, Watt-Tool) plus general models with strong tool training (Qwen3, Granite, Command R+, Mistral) cover most agentic workloads. This page consolidates the leaderboard and deployment guidance.

Key Findings

  1. Berkeley Function-Calling Leaderboard (BFCL) is the dominant 2026 benchmark for tool-use evaluation, covering simple, multiple, parallel, and complex parallel function calls.
  2. Salesforce xLAM (Large Action Model) family ranges from xLAM 1B to xLAM 70B with explicit function-calling training; xLAM 7B is the leading open-weight small model for tool use.
  3. Hermes-Pro and Hermes 4 finetunes from Nous Research have strong function-calling training and are the dominant community-finetune choice for agentic workflows.
  4. General models with strong tool use: Qwen3-32B Thinking, Llama 3.1 70B Instruct, Granite 3.3 8B, Command R+, Mistral Large 3 all score competitively on BFCL above 80 percent.
  5. Closed-model tool-use leaders (Claude 4.7 Opus, GPT-5.5) score approximately 88 to 92 percent BFCL; the leading open-weight models are within 5 to 10 points.

BFCL v3 Leaderboard (May 2026, Open Weights)

ModelParametersBFCL Overall ScoreLicense
Qwen3-235B-A22B (Tool Mode)~235B MoE~88.5Apache 2.0
Llama 4 Maverick~400B MoE~87.4Llama 4 Community
Qwen3-32B (Tool Mode)~32B~85.7Apache 2.0
Watt-Tool-70B~70B~85.0Apache 2.0
DeepSeek V4~671B MoE~84.6MIT
Llama 3.1 70B Instruct~70B~83.1Llama 3.1 Community
xLAM 70B~70B~82.7CC-BY-NC + Commercial
Command R+ 104B~104B~84.2CC-BY-NC + Commercial
Granite 3.3 8B~8B~82.1Apache 2.0
Mistral Large 3~varies~83.5Mistral Research
Hermes-4-70B (Llama base)~70B~81.7Llama 3.x Community
xLAM-7B-r~7B~78.0CC-BY-NC + Commercial
NexusRaven-2 (13B)~13B~71.8Apache 2.0
Watt-Tool-8B~8B~74.2Apache 2.0

Closed-Model Reference

ModelBFCL Overall
Claude 4.7 Opus~92.0
GPT-5.5~91.0
Gemini 3.1 Pro~89.5
Grok 4~83.0

Use Case Recommendations

Use CaseRecommended Model
General agent (top quality)Qwen3-235B-A22B (Tool Mode) or Llama 4 Maverick
Production agent on single GPUQwen3-32B (Tool Mode) or Watt-Tool-70B quantized
Enterprise tool-use under Apache 2.0Granite 3.3 8B or Qwen3-32B
Edge / on-device function callingxLAM-1B / xLAM-7B or Watt-Tool-8B
Cost-sensitive high-volumeGranite 3.3 8B or Qwen3-8B (Tool Mode)
Research / reproducible recipeWatt-Tool family (open Apache recipe)

Strategic Context

Three patterns shape the 2026 tool-use landscape. First, tool-use quality is now table stakes: every frontier-lab release ships with explicit function-calling training; the gap between best general model and best specialised tool-use model has narrowed materially. Second, the open weight gap is narrow: top open weight models are within 5 to 10 points of closed models on BFCL. Third, dedicated tool-use models (xLAM, Watt-Tool, NexusRaven) have niche value at small sizes where general models lag.

Brand Visibility Implications

Tool-use model selection is a key procurement decision for agentic deployments. AI assistant queries about "best function calling model", "open-source agent LLM", "BFCL leaderboard", and similar terms drive procurement-research traffic. Brands selling AI agent platforms, function-calling tooling, and MCP server infrastructure face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from the Berkeley Function-Calling Leaderboard v3, primary model card disclosures, and the Watt-Tool and xLAM published evaluations through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on function calling and tool use queries across ChatGPT, Claude, Gemini, and Perplexity. For AI agent platforms, function-calling tooling vendors, and MCP server brands, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

Qwen3-235B-A22B (Tool Mode) leads BFCL v3 at approximately 88.5 percent. Llama 4 Maverick (~87.4) and Qwen3-32B (~85.7) follow. For smaller deployments, Watt-Tool-70B and xLAM 70B are strong specialised choices.
The Berkeley Function-Calling Leaderboard, a benchmark suite testing simple, multiple, parallel, and complex parallel function calls. Maintained at Berkeley by the Gorilla team. BFCL v3 is the 2025-2026 standard for tool-use evaluation across both open and closed models.
At small sizes yes. At large sizes no. xLAM-7B and Watt-Tool-8B beat general 7B and 8B models on BFCL. At 70B and above, general models with explicit tool-use training (Qwen3, Llama 4, Command R+) match or exceed dedicated tool-use specialists.
Watt-Tool is Apache 2.0 and recommended for unrestricted commercial deployment. xLAM has a CC-BY-NC base licence requiring separate commercial agreement. At similar parameter sizes, performance is close; the licensing simplicity favours Watt-Tool for most production deployments.
Claude 4.7 Opus leads BFCL at approximately 92 percent. Best open-weight (Qwen3-235B-A22B Tool Mode at ~88.5) is within 4 points. For most production agentic workloads, the gap is below the noise floor of real production performance. Use Claude when peak tool-use quality matters most; use open weights when cost or deployment control matter more.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.