Multi-Agent Architecture: When One AI Isn't Enough
What is multi-agent architecture?
Multi-agent architecture is a system design pattern where a complex AI workflow is broken into multiple specialized agents, each responsible for one task — such as classification, generation, or validation — coordinated by an orchestrator. Agents communicate through structured outputs and can run sequentially or in parallel, enabling better quality, cost control, and scalability than a single monolithic LLM prompt.
TL;DR
- -Rule of thumb: if a task has more than 2 steps with different model or prompt requirements, split into agents
- -Three orchestration patterns: Sequential Pipeline (simple ETL), Parallel Fan-Out (independent subtasks), Classifier + Router (dynamic routing)
- -Classifier + Worker uses a cheap model (e.g. Haiku) for routing, reserving expensive models only for the actual work
- -Structured output (Pydantic schemas) at every agent boundary is mandatory — without it, multi-agent systems are black boxes
- -Distributed tracing with Langfuse or LangSmith is required for debugging; every call, prompt, and result must be observable
A single LLM call handles a single task. But real product workflows rarely fit in one call. Analyzing a document, generating a plan, validating quality, formatting the output — those are four separate tasks with different model requirements, context needs, and prompts. Cramming everything into one prompt degrades quality at every step.
Multi-agent architecture breaks a complex workflow into specialized agents. Each agent does one thing well: one analyzes, one generates, one validates. Agents communicate through a protocol, pass results to each other, and can run in parallel.
This article covers how to design a multi-agent system for a startup: which orchestration patterns to choose, how to divide responsibility between agents, and when a single agent is actually better than three.
One Agent vs. Many: Where to Draw the Line
A single agent works fine when the task is linear. Q&A, summarization, classification — one prompt, one call, one result. Problems start when the task contains conflicting requirements.
Example: an AI assistant for a travel app. The user types “plan a trip to Japan for 10 days.” A single agent would need to:
- Understand the intent and extract parameters (NLU)
- Find attractions and assess logistics (search + reasoning)
- Build an itinerary accounting for distances and time (optimization)
- Generate descriptions for each day (text generation)
- Validate the result for errors (validation)
Each step requires a different prompt, different temperature, sometimes a different model. NLU works best with temperature: 0. Text generation — with temperature: 0.7. Route optimization might need a reasoning model like o3. Day descriptions work great with the cheap and fast Gemini Flash.
Stuffing everything into one prompt gives the model contradictory instructions. The result: a bloated 3,000+ token prompt, unpredictable behavior, and no way to debug a specific step.
Rule of thumb: if a task has more than two steps with different model or prompt requirements — split into agents.
Three Orchestration Patterns for Multi-Agent Systems
Sequential Pipeline
The simplest pattern. Agents are chained together — the output of one becomes the input of the next.
Input → [Agent A] → [Agent B] → [Agent C] → Output
Extract Process Format
The implementation is minimal:
async def pipeline(user_input: str) -> str:
# Step 1: Extract parameters
params = await agent_extract.run(user_input)
# Step 2: Generate plan
plan = await agent_plan.run(params)
# Step 3: Validate and format
result = await agent_format.run(plan)
return result
Pros: easy to debug (each step can be tested independently), predictable execution order, easy to add intermediate logging.
Cons: total time = sum of all steps. If Agent B fails, the whole pipeline fails. No parallelism.
When to use: ETL-like tasks, document processing, any workflow with a fixed step order.
Parallel Fan-Out / Fan-In
Multiple agents work simultaneously on different aspects of the same task. Results are collected and merged.
┌─ [Agent A: Safety] ──┐
Input ────────┤ [Agent B: Quality] ──├───► Merge → Output
└─ [Agent C: Style] ──┘
async def parallel_review(code: str) -> dict:
# Run three agents in parallel
safety, quality, style = await asyncio.gather(
agent_safety.run(code),
agent_quality.run(code),
agent_style.run(code),
)
# Merge results
return merge_reviews(safety, quality, style)
This pattern is used in Claude Concilium for parallel consultations with multiple LLMs. Three models independently analyze the same code; results are compared to find consensus.
Pros: total time = time of the slowest agent (not the sum). Independence: one agent failing doesn’t block the others.
Cons: requires result-merging logic. Agents don’t see each other’s outputs.
When to use: code review, multi-aspect analysis, A/B prompt testing, validation from multiple perspectives.
Router: Dynamic Agent Routing
A router agent analyzes the request and dispatches it to a specialized agent. Think of it as an API gateway in microservices.
┌─ [Agent: Travel Planning]
Input → [Router] ─┤─ [Agent: Booking]
└─ [Agent: General Chat]
ROUTES = {
"travel_plan": agent_travel,
"booking": agent_booking,
"general": agent_chat,
}
async def router(user_input: str) -> str:
# Classify intent (cheap model)
intent = await classifier.run(
f"Classify intent: {user_input}",
model="gemini-2.0-flash",
temperature=0,
)
# Route to specialized agent
agent = ROUTES.get(intent, agent_chat)
return await agent.run(user_input)
The router uses a cheap model for classification. Specialized agents can use expensive models with larger context windows. Savings: 80%+ of requests are handled by cheaper agents.
For more on model routing by task type, see the article on multi-provider LLM architecture.
When to use: chatbots with multiple domains, heterogeneous document processing, systems with clearly defined specialization.
Agent Specialization: Who Does What
A typical breakdown for a product startup:
| Agent | Task | Model | Temperature | Prompt size |
|---|---|---|---|---|
| Classifier | Intent detection / routing | Gemini Flash | 0 | ~200 tokens |
| Extractor | Data extraction from text | DeepSeek Chat | 0 | ~500 tokens |
| Planner | Plan and itinerary generation | Claude Sonnet / o3 | 0.3 | ~1500 tokens |
| Writer | User-facing text generation | Gemini Flash | 0.7 | ~800 tokens |
| Validator | JSON schema and fact checking | GPT-4o | 0 | ~400 tokens |
| Judge | Output quality scoring | Claude Sonnet | 0.2 | ~600 tokens |
The Judge agent deserves special attention. It doesn’t generate content — it evaluates other agents’ output against criteria: completeness, relevance, safety. More on this pattern in the LLM-as-Judge article.
Each agent is a combination of three things: a system prompt, a model choice, and a set of tools/functions. Example configuration:
from dataclasses import dataclass
from typing import Optional
@dataclass
class AgentConfig:
name: str
system_prompt: str
model: str
temperature: float = 0.0
max_tokens: int = 4096
tools: list[str] | None = None
timeout_seconds: int = 30
AGENTS = {
"classifier": AgentConfig(
name="classifier",
system_prompt="You are an intent classifier. Return one of: travel_plan, booking, general.",
model="google/gemini-2.0-flash",
temperature=0,
max_tokens=50,
timeout_seconds=5,
),
"planner": AgentConfig(
name="planner",
system_prompt="You are a travel planner. Create detailed day-by-day itineraries.",
model="anthropic/claude-sonnet-4",
temperature=0.3,
max_tokens=8192,
tools=["search_places", "get_distances", "check_opening_hours"],
timeout_seconds=60,
),
"validator": AgentConfig(
name="validator",
system_prompt="Validate the travel plan. Check: dates are consistent, distances are realistic, no duplicate places.",
model="openai/gpt-4o",
temperature=0,
max_tokens=2048,
timeout_seconds=15,
),
}
Communication Protocol: How Agents Exchange Data
Agents need to exchange data in a predictable format. Two approaches.
Approach 1: Structured Output (JSON)
Each agent returns a JSON object matching a predefined schema. The next agent receives that JSON as part of its prompt.
from pydantic import BaseModel
class TravelParams(BaseModel):
destination: str
duration_days: int
interests: list[str]
budget_level: str # "budget" | "mid" | "luxury"
class DayPlan(BaseModel):
day: int
activities: list[str]
estimated_cost_usd: float
class TravelPlan(BaseModel):
params: TravelParams
days: list[DayPlan]
total_cost_usd: float
# Agent A → structured output
params: TravelParams = await agent_extract.run(
user_input,
response_format=TravelParams,
)
# Agent B accepts structured input
plan: TravelPlan = await agent_plan.run(
f"Create a travel plan based on: {params.model_dump_json()}",
response_format=TravelPlan,
)
Pros: type safety, validation at every step, easy to log and debug.
Cons: structured output overhead (not all models support it equally well), schema rigidity.
Approach 2: MCP (Model Context Protocol)
Agents interact through a standardized protocol. Each agent is an MCP server that exposes tools. The orchestrator calls the tools of the appropriate agent.
Orchestrator (Claude Code)
│
├── MCP: agent-planner
│ └── tool: create_plan(destination, days, interests)
│
├── MCP: agent-validator
│ └── tool: validate_plan(plan_json)
│
└── MCP: agent-writer
└── tool: write_description(day_plan)
MCP provides a standard interface for connecting agents. An agent server can be written in any language and run locally or remotely. More on building production MCP servers in the custom MCP servers article.
In practice, MCP is better suited for dev-time agents (IDE assistants), while structured output is better for runtime agents (handling user requests in production).
The Orchestrator: Central Component of the System
The orchestrator decides: which agent to call, in what order, and what to do with errors. A minimal implementation:
import asyncio
import logging
from typing import Any
logger = logging.getLogger(__name__)
class Orchestrator:
def __init__(self, agents: dict[str, AgentConfig]):
self.agents = agents
async def run_agent(self, name: str, input_data: str) -> Any:
"""Run a single agent with timeout and retry."""
config = self.agents[name]
for attempt in range(3):
try:
result = await asyncio.wait_for(
call_llm(
model=config.model,
system=config.system_prompt,
user=input_data,
temperature=config.temperature,
max_tokens=config.max_tokens,
tools=config.tools,
),
timeout=config.timeout_seconds,
)
logger.info(f"Agent {name} completed", extra={
"agent": name,
"attempt": attempt + 1,
"model": config.model,
})
return result
except asyncio.TimeoutError:
logger.warning(f"Agent {name} timeout, attempt {attempt + 1}")
continue
except Exception as e:
logger.error(f"Agent {name} failed: {e}")
if attempt == 2:
raise
raise RuntimeError(f"Agent {name} failed after 3 attempts")
async def run_pipeline(self, user_input: str) -> str:
"""Sequential pipeline: classify → plan → validate → format."""
# Step 1: Classification
intent = await self.run_agent("classifier", user_input)
if intent == "general":
return await self.run_agent("writer", user_input)
# Step 2: Extract parameters
params = await self.run_agent("extractor", user_input)
# Step 3: Generate plan
plan = await self.run_agent("planner", params)
# Step 4: Parallel validation
validation, quality_score = await asyncio.gather(
self.run_agent("validator", plan),
self.run_agent("judge", plan),
)
# Step 5: If validation failed — retry with feedback
if not validation.get("is_valid"):
plan = await self.run_agent("planner",
f"Previous plan had issues: {validation['issues']}. "
f"Original request: {params}. Fix the plan."
)
return plan
Key design decisions in the orchestrator:
- Retry with context. On retry, the agent receives information about what went wrong. Not a blind retry, but a retry with validator feedback.
- Parallelize where possible. Validator and Judge run concurrently — saving 30–50% of total time.
- Graceful degradation. If Judge is unavailable, the pipeline continues without quality scoring. A response without a score beats no response at all.
Observability: Seeing What’s Happening Inside
A multi-agent system without observability is a black box. Three essential components.
1. Tracing
Every user request is a trace. Every agent call is a span within that trace. Langfuse is a great fit here:
from langfuse import Langfuse
langfuse = Langfuse()
async def run_with_tracing(user_input: str) -> str:
trace = langfuse.trace(
name="travel-planning",
input=user_input,
)
# Classifier span
classifier_span = trace.span(name="classifier")
intent = await run_agent("classifier", user_input)
classifier_span.end(output=intent)
# Planner span
planner_span = trace.span(name="planner")
plan = await run_agent("planner", user_input)
planner_span.end(output=plan)
# ... other agents
trace.update(output=plan)
return plan
In Langfuse you can see: which agents were called, how long each took, how many tokens each consumed, and what each returned. More on setup in the LLM Observability with Langfuse article.
2. Metrics
A minimal set of metrics for a multi-agent system:
| Metric | What it shows | Alert threshold |
|---|---|---|
agent.latency_p95 | Speed per agent | > 10s for classifier |
agent.error_rate | Error rate per agent | > 5% |
agent.token_cost | Cost per agent | > $X/day budget |
pipeline.success_rate | Successful pipeline runs | < 95% |
pipeline.retry_rate | How often retries are needed | > 20% |
judge.quality_score | Average quality score | < 0.7 |
3. Decision Logging
Every orchestrator decision gets logged: why this agent was chosen, why a retry happened, why the pipeline took an alternate path. Without this, debugging is pure guesswork.
logger.info("routing_decision", extra={
"trace_id": trace.id,
"intent": intent,
"selected_agent": "planner",
"reason": "intent=travel_plan, confidence=0.95",
"fallback_agent": "writer",
})
Cost: Multi-Agent System vs. One Big Prompt
More agents = more LLM calls = more cost. But a well-designed architecture can actually be cheaper than a single large prompt.
Cost comparison for “plan a trip”:
| Approach | Calls | Models | Tokens (input) | Tokens (output) | Estimated cost |
|---|---|---|---|---|---|
| Single agent | 1 | Claude Sonnet | ~3000 | ~2000 | ~$0.025 |
| Pipeline (5 agents) | 5 | Mix | ~2500 total | ~1800 total | ~$0.008 |
The pipeline is cheaper because:
- Classifier uses Flash (~$0.0001 per call)
- Extractor uses DeepSeek ($0.00014/1M input tokens)
- Only the Planner uses an expensive model, and its prompt is shorter (parameters are already extracted)
Result: 60–70% savings on LLM costs with proper task splitting and model routing.
Common Mistakes in Multi-Agent System Design
Over-engineering. Three agents for a task that a single prompt with structured output could handle. If the task is linear and doesn’t require different models — one agent is simpler and more reliable.
No fallback. The planner agent goes down — the entire pipeline is dead. Every critical agent needs a fallback: an alternative model, a simplified prompt, or a cache of previous results.
Chains that are too long. Each agent adds latency and a potential failure point. An 8-agent pipeline with P95 latency of 3 seconds per agent means 24 seconds per response. Users won’t wait.
Ignoring the context window. Passing the full output of one agent into the next agent’s prompt. A 5,000-token planner output plus a 500-token validator system prompt — unnecessary cost. Pass only what the current agent actually needs.
No circuit breaker. One agent returns malformed JSON. The next agent gets broken input, fails, the orchestrator retries — three attempts, three failures, wasted tokens. Circuit breaker: if an agent returns invalid output, don’t pass it downstream — surface the error immediately.
Where to Start
- Identify the bottleneck. Take your current single-agent pipeline. Find the step that breaks most often or produces poor quality. Extract it into a separate agent.
- Start with two agents. Classifier + Worker. The Classifier identifies the task type on a cheap model. The Worker handles it on the right model. This already gives you routing and cost savings.
- Add structured output. Pydantic schemas for inputs and outputs of each agent. Validation at every step. Without this, debugging a multi-agent system is nearly impossible.
- Connect tracing. Langfuse, LangSmith, or equivalent. See every call, every prompt, every result. Without observability, a multi-agent system is a black box.
- Add a Judge agent. Automatic quality scoring at the end of the pipeline. If the score falls below threshold — retry with feedback. This closes the quality loop.
- Optimize from metrics. Watch latency, cost, and quality score per agent. Swap models, tweak prompts, and reshape architecture based on real data.
Frequently Asked Questions
When should I use multi-agent architecture instead of a single LLM call?
Use multi-agent when the task contains conflicting requirements — e.g., analysis needs a large context while generation needs a focused one. Also when different steps benefit from different models (cheap for classification, powerful for generation) or when you need parallel execution.
What is the simplest multi-agent pattern to start with?
Classifier + Worker. The Classifier identifies the task type on a cheap model (like Haiku), then routes to the appropriate Worker with the right model and prompt. This gives you task routing and cost savings with minimal complexity.
How do you debug a multi-agent system?
Structured output (Pydantic schemas) at every agent boundary plus distributed tracing (Langfuse or LangSmith). Without both, a multi-agent system is a black box. Every call, prompt, and result must be observable.