Tool use and function calling became the dominant pattern for production AI agents in 2025-2026. The Berkeley Function-Calling Leaderboard (BFCL) tracks the state of the art across closed and open models. Specialised open-weight tool-use families (Salesforce xLAM, NexusRaven, Hermes-Pro, Watt-Tool) plus general models with strong tool training (Qwen3, Granite, Command R+, Mistral) cover most agentic workloads. This page consolidates the leaderboard and deployment guidance.
Key Findings
- Berkeley Function-Calling Leaderboard (BFCL) is the dominant 2026 benchmark for tool-use evaluation, covering simple, multiple, parallel, and complex parallel function calls.
- Salesforce xLAM (Large Action Model) family ranges from xLAM 1B to xLAM 70B with explicit function-calling training; xLAM 7B is the leading open-weight small model for tool use.
- Hermes-Pro and Hermes 4 finetunes from Nous Research have strong function-calling training and are the dominant community-finetune choice for agentic workflows.
- General models with strong tool use: Qwen3-32B Thinking, Llama 3.1 70B Instruct, Granite 3.3 8B, Command R+, Mistral Large 3 all score competitively on BFCL above 80 percent.
- Closed-model tool-use leaders (Claude 4.7 Opus, GPT-5.5) score approximately 88 to 92 percent BFCL; the leading open-weight models are within 5 to 10 points.
BFCL v3 Leaderboard (May 2026, Open Weights)
| Model | Parameters | BFCL Overall Score | License |
|---|---|---|---|
| Qwen3-235B-A22B (Tool Mode) | ~235B MoE | ~88.5 | Apache 2.0 |
| Llama 4 Maverick | ~400B MoE | ~87.4 | Llama 4 Community |
| Qwen3-32B (Tool Mode) | ~32B | ~85.7 | Apache 2.0 |
| Watt-Tool-70B | ~70B | ~85.0 | Apache 2.0 |
| DeepSeek V4 | ~671B MoE | ~84.6 | MIT |
| Llama 3.1 70B Instruct | ~70B | ~83.1 | Llama 3.1 Community |
| xLAM 70B | ~70B | ~82.7 | CC-BY-NC + Commercial |
| Command R+ 104B | ~104B | ~84.2 | CC-BY-NC + Commercial |
| Granite 3.3 8B | ~8B | ~82.1 | Apache 2.0 |
| Mistral Large 3 | ~varies | ~83.5 | Mistral Research |
| Hermes-4-70B (Llama base) | ~70B | ~81.7 | Llama 3.x Community |
| xLAM-7B-r | ~7B | ~78.0 | CC-BY-NC + Commercial |
| NexusRaven-2 (13B) | ~13B | ~71.8 | Apache 2.0 |
| Watt-Tool-8B | ~8B | ~74.2 | Apache 2.0 |
Closed-Model Reference
| Model | BFCL Overall |
|---|---|
| Claude 4.7 Opus | ~92.0 |
| GPT-5.5 | ~91.0 |
| Gemini 3.1 Pro | ~89.5 |
| Grok 4 | ~83.0 |
Use Case Recommendations
| Use Case | Recommended Model |
|---|---|
| General agent (top quality) | Qwen3-235B-A22B (Tool Mode) or Llama 4 Maverick |
| Production agent on single GPU | Qwen3-32B (Tool Mode) or Watt-Tool-70B quantized |
| Enterprise tool-use under Apache 2.0 | Granite 3.3 8B or Qwen3-32B |
| Edge / on-device function calling | xLAM-1B / xLAM-7B or Watt-Tool-8B |
| Cost-sensitive high-volume | Granite 3.3 8B or Qwen3-8B (Tool Mode) |
| Research / reproducible recipe | Watt-Tool family (open Apache recipe) |
Strategic Context
Three patterns shape the 2026 tool-use landscape. First, tool-use quality is now table stakes: every frontier-lab release ships with explicit function-calling training; the gap between best general model and best specialised tool-use model has narrowed materially. Second, the open weight gap is narrow: top open weight models are within 5 to 10 points of closed models on BFCL. Third, dedicated tool-use models (xLAM, Watt-Tool, NexusRaven) have niche value at small sizes where general models lag.
Brand Visibility Implications
Tool-use model selection is a key procurement decision for agentic deployments. AI assistant queries about "best function calling model", "open-source agent LLM", "BFCL leaderboard", and similar terms drive procurement-research traffic. Brands selling AI agent platforms, function-calling tooling, and MCP server infrastructure face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from the Berkeley Function-Calling Leaderboard v3, primary model card disclosures, and the Watt-Tool and xLAM published evaluations through 23 May 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on function calling and tool use queries across ChatGPT, Claude, Gemini, and Perplexity. For AI agent platforms, function-calling tooling vendors, and MCP server brands, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.