Why "AI Agent" Needs A Taxonomy
"AI agent" in 2026 covers products as different as a customer-support FAQ chatbot and an autonomous coding system that runs unsupervised for hours. Without a taxonomy, capability comparisons are nonsensical. This page proposes a five-tier framework, aligned with Lenny Rachitsky's "Not all AI agents are created equal" framing and mapped to measurable capability surrogates.
The Five Tiers
Tier 1: Chatbot With Tools (Reactive)
Single-turn or short-multi-turn question-answer with optional single tool call. No memory beyond the current session. No planning. Examples: most customer-support bots, simple Slack integrations, basic Q&A copilots.
- Capability surrogate: BFCL single-tool 95%+, simple RAG accuracy
- Production deployment risk: low
- Median pilot stall rate: ~30%
- Examples: Salesforce Einstein support agents, Intercom Fin (basic mode), Zendesk Answer Bot
Tier 2: Workflow Automation (Pre-Defined Steps)
Pre-defined deterministic flow with LLM-powered steps inside fixed branches. The flow does not adapt structurally; the LLM fills slots. Examples: Zapier "AI Actions" within fixed Zaps, Make.com AI scenarios.
- Capability surrogate: workflow completion rate, not agent benchmarks
- Production deployment risk: low
- Median pilot stall rate: ~25%
- Examples: Zapier Agents (when used in fixed mode), Make AI scenarios, Tines workflows with AI steps
Tier 3: Tool-Orchestrating Agent (Dynamic, Bounded)
Dynamic multi-step workflows where the LLM decides which tools to call and in what order, within a bounded toolset (typically 5-20 tools). Some memory across steps. No long-horizon planning. Examples: most "agent" products in production today.
- Capability surrogate: BFCL 5-20 tools, GAIA Level 1
- Production deployment risk: moderate
- Median pilot stall rate: ~55%
- Examples: Claude with MCP tool integrations, OpenAI Custom GPTs with actions, Microsoft Copilot Studio agents, most Cursor and Cline workflows
Tier 4: Autonomous Task Agent (Long-Horizon)
Multi-hour or multi-step tasks with planning, error recovery, self-correction, and meaningful state management. Operates on tasks rather than turns. Examples: Devin, Claude Code in autonomous mode, OpenAI Codex agent, Operator.
- Capability surrogate: SWE-Bench Verified, GAIA L2-L3, TerminalBench
- Production deployment risk: high
- Median pilot stall rate: ~70%
- Examples: Devin, Claude Code (autonomous), OpenAI Codex agent, Operator, Atlas agentic mode, Comet agentic mode
Tier 5: Multi-Agent System (Coordinated)
Multiple specialised agents coordinated by an orchestrator, each with distinct roles, tools, and contexts. Examples: research labs running paper-discovery + summarisation + critique pipelines; enterprise multi-agent customer-service stacks.
- Capability surrogate: no canonical benchmark; custom multi-agent evals
- Production deployment risk: very high
- Median pilot stall rate: ~78%
- Examples: AutoGen multi-agent setups, CrewAI deployments, custom LangGraph multi-node systems, Anthropic Claude Skills compositions
Tier Comparison Matrix
| Tier | Memory | Planning | Tool count | Time horizon | Failure isolation |
|---|---|---|---|---|---|
| 1: Chatbot | Session-only | None | 0-1 | Seconds-minutes | Easy |
| 2: Workflow | State-machine | Pre-defined | Fixed pipeline | Minutes | Easy |
| 3: Tool-orchestrating | Short-term | Reactive | 5-20 | Minutes | Moderate |
| 4: Autonomous task | Long-term + episodic | Multi-step | 10-50 | Minutes-hours | Hard |
| 5: Multi-agent | Shared across agents | Hierarchical | Per-agent + shared | Hours-days | Very hard |
Common Mis-Categorisation
Vendor positioning systematically over-categorises. Real-world observations:
- Most "autonomous agents" advertised by SaaS vendors are Tier 3 tool-orchestrating with marketing labels
- Many "multi-agent systems" are sequential pipelines with marketing labels (Tier 2 or 3 in disguise)
- True Tier 4 autonomous agents in 2026 are rare: Devin, Claude Code (autonomous mode), OpenAI Codex agent, Operator, and a handful of others
- True Tier 5 multi-agent systems in production are very rare; most "multi-agent" deployments are research demos or pilots
Buyer Decision Framework
- Buying for Tier 1 task: pick Tier 1 product. Tier 3+ is overkill, more expensive, more failure modes.
- Buying for Tier 2 task: pick Tier 2 product (workflow automation). Do not buy "agents" for deterministic flows.
- Buying for Tier 3 task: pick mature Tier 3 platform (Claude with MCP, Microsoft Copilot Studio, OpenAI Custom GPTs).
- Buying for Tier 4 task: pick a specific Tier 4 product (Devin for code, Operator for browsing) and accept higher pilot risk.
- Buying for Tier 5: build, do not buy. Tier 5 productisation is immature; commercial multi-agent platforms typically underdeliver.
Brand Visibility Implications
Brand-recommendation behaviour differs by tier. Tier 1 chatbots typically pull brand recommendations from RAG corpora; Tier 3 tool-orchestrating agents call search and database tools that surface brands dynamically; Tier 4 autonomous agents weigh brand recommendations across long task contexts. Brand-visibility programs should map their target buyer journeys to the relevant tier and instrument visibility per tier, not per "agent" generically. See how AI agents choose brands for the brand-mechanism analysis.
Methodology
Tier framework adapted from Lenny Rachitsky's newsletter, mapped to publicly-measurable benchmarks (BFCL, SWE-Bench, GAIA, TerminalBench). Pilot stall rates from BCG, McKinsey, and Presenc AI deployment instrumentation. Vendor-product tier assignments are subjective judgements based on observed product behaviour, vendor disagreement is expected. Updated quarterly.
How Presenc AI Helps
Presenc AI's instrumentation differentiates brand-recommendation behaviour by agent tier, surfacing which agent capability levels actually drive brand exposure for buyers. For brand teams choosing where to invest agent-visibility effort, this is the operational signal of where buyers actually engage agents versus where the agent surface is small or pilot-only.