When should I use multi-agent architecture instead of a single LLM call?

Use multi-agent when the task contains conflicting requirements — e.g., analysis needs a large context while generation needs a focused one. Also when different steps benefit from different models (cheap for classification, powerful for generation) or when you need parallel execution.

What is the simplest multi-agent pattern to start with?

Classifier + Worker. The Classifier identifies the task type on a cheap model (like Haiku), then routes to the appropriate Worker with the right model and prompt. This gives you task routing and cost savings with minimal complexity.

How do you debug a multi-agent system?

Structured output (Pydantic schemas) at every agent boundary plus distributed tracing (Langfuse or LangSmith). Without both, a multi-agent system is a black box. Every call, prompt, and result must be observable.

Multi-Agent Architecture Patterns: When One AI Isn't Enough

A single LLM call handles a single task. But real product workflows rarely fit in one call. Analyzing a document, generating a plan, validating quality, formatting the output — those are four separate tasks with different model requirements, context needs, and prompts. Cramming everything into one prompt degrades quality at every step.

Multi-agent architecture breaks a complex workflow into specialized agents. Each agent does one thing well: one analyzes, one generates, one validates. Agents communicate through a protocol, pass results to each other, and can run in parallel.

This article covers how to design a multi-agent system for a startup: which orchestration patterns to choose, how to divide responsibility between agents, and when a single agent is actually better than three.

One Agent vs. Many: Where to Draw the Line

A single agent works fine when the task is linear. Q&A, summarization, classification — one prompt, one call, one result. Problems start when the task contains conflicting requirements.

Example: an AI assistant for a travel app. The user types “plan a trip to Japan for 10 days.” A single agent would need to:

Understand the intent and extract parameters (NLU)
Find attractions and assess logistics (search + reasoning)
Build an itinerary accounting for distances and time (optimization)
Generate descriptions for each day (text generation)
Validate the result for errors (validation)

Each step requires a different prompt, different temperature, sometimes a different model. NLU works best with temperature: 0. Text generation — with temperature: 0.7. Route optimization might need a reasoning model like o3. Day descriptions work great with the cheap and fast Gemini Flash.

Stuffing everything into one prompt gives the model contradictory instructions. The result: a bloated 3,000+ token prompt, unpredictable behavior, and no way to debug a specific step.

Rule of thumb: if a task has more than two steps with different model or prompt requirements — split into agents.

Three Orchestration Patterns for Multi-Agent Systems

Sequential Pipeline

The simplest pattern. Agents are chained together — the output of one becomes the input of the next.

Input → [Agent A] → [Agent B] → [Agent C] → Output
         Extract      Process      Format

The implementation is minimal:

async def pipeline(user_input: str) -> str:
    # Step 1: Extract parameters
    params = await agent_extract.run(user_input)

    # Step 2: Generate plan
    plan = await agent_plan.run(params)

    # Step 3: Validate and format
    result = await agent_format.run(plan)

    return result

Pros: easy to debug (each step can be tested independently), predictable execution order, easy to add intermediate logging.

Cons: total time = sum of all steps. If Agent B fails, the whole pipeline fails. No parallelism.

When to use: ETL-like tasks, document processing, any workflow with a fixed step order.

Parallel Fan-Out / Fan-In

Multiple agents work simultaneously on different aspects of the same task. Results are collected and merged.

              ┌─ [Agent A: Safety] ──┐
Input ────────┤  [Agent B: Quality] ──├───► Merge → Output
              └─ [Agent C: Style]  ──┘

async def parallel_review(code: str) -> dict:
    # Run three agents in parallel
    safety, quality, style = await asyncio.gather(
        agent_safety.run(code),
        agent_quality.run(code),
        agent_style.run(code),
    )

    # Merge results
    return merge_reviews(safety, quality, style)

This pattern is used in Claude Concilium for parallel consultations with multiple LLMs. Three models independently analyze the same code; results are compared to find consensus.

Pros: total time = time of the slowest agent (not the sum). Independence: one agent failing doesn’t block the others.

Cons: requires result-merging logic. Agents don’t see each other’s outputs.

When to use: code review, multi-aspect analysis, A/B prompt testing, validation from multiple perspectives.

Router: Dynamic Agent Routing

A router agent analyzes the request and dispatches it to a specialized agent. Think of it as an API gateway in microservices.

              ┌─ [Agent: Travel Planning]
Input → [Router] ─┤─ [Agent: Booking]
              └─ [Agent: General Chat]

ROUTES = {
    "travel_plan": agent_travel,
    "booking": agent_booking,
    "general": agent_chat,
}

async def router(user_input: str) -> str:
    # Classify intent (cheap model)
    intent = await classifier.run(
        f"Classify intent: {user_input}",
        model="gemini-2.0-flash",
        temperature=0,
    )

    # Route to specialized agent
    agent = ROUTES.get(intent, agent_chat)
    return await agent.run(user_input)

The router uses a cheap model for classification. Specialized agents can use expensive models with larger context windows. Savings: 80%+ of requests are handled by cheaper agents.

For more on model routing by task type, see the article on multi-provider LLM architecture.

When to use: chatbots with multiple domains, heterogeneous document processing, systems with clearly defined specialization.

Agent Specialization: Who Does What

A typical breakdown for a product startup:

Agent	Task	Model	Temperature	Prompt size
Classifier	Intent detection / routing	Gemini Flash	0	~200 tokens
Extractor	Data extraction from text	DeepSeek Chat	0	~500 tokens
Planner	Plan and itinerary generation	Claude Sonnet / o3	0.3	~1500 tokens
Writer	User-facing text generation	Gemini Flash	0.7	~800 tokens
Validator	JSON schema and fact checking	GPT-5.4	0	~400 tokens
Judge	Output quality scoring	Claude Sonnet	0.2	~600 tokens

The Judge agent deserves special attention. It doesn’t generate content — it evaluates other agents’ output against criteria: completeness, relevance, safety. More on this pattern in the LLM-as-Judge article.

Each agent is a combination of three things: a system prompt, a model choice, and a set of tools/functions. Example configuration:

from dataclasses import dataclass
from typing import Optional

@dataclass
class AgentConfig:
    name: str
    system_prompt: str
    model: str
    temperature: float = 0.0
    max_tokens: int = 4096
    tools: list[str] | None = None
    timeout_seconds: int = 30

AGENTS = {
    "classifier": AgentConfig(
        name="classifier",
        system_prompt="You are an intent classifier. Return one of: travel_plan, booking, general.",
        model="google/gemini-2.0-flash",
        temperature=0,
        max_tokens=50,
        timeout_seconds=5,
    ),
    "planner": AgentConfig(
        name="planner",
        system_prompt="You are a travel planner. Create detailed day-by-day itineraries.",
        model="anthropic/claude-sonnet-4",
        temperature=0.3,
        max_tokens=8192,
        tools=["search_places", "get_distances", "check_opening_hours"],
        timeout_seconds=60,
    ),
    "validator": AgentConfig(
        name="validator",
        system_prompt="Validate the travel plan. Check: dates are consistent, distances are realistic, no duplicate places.",
        model="openai/gpt-5.4",
        temperature=0,
        max_tokens=2048,
        timeout_seconds=15,
    ),
}

Communication Protocol: How Agents Exchange Data

Agents need to exchange data in a predictable format. Two approaches.

Approach 1: Structured Output (JSON)

Each agent returns a JSON object matching a predefined schema. The next agent receives that JSON as part of its prompt.

from pydantic import BaseModel

class TravelParams(BaseModel):
    destination: str
    duration_days: int
    interests: list[str]
    budget_level: str  # "budget" | "mid" | "luxury"

class DayPlan(BaseModel):
    day: int
    activities: list[str]
    estimated_cost_usd: float

class TravelPlan(BaseModel):
    params: TravelParams
    days: list[DayPlan]
    total_cost_usd: float

# Agent A → structured output
params: TravelParams = await agent_extract.run(
    user_input,
    response_format=TravelParams,
)

# Agent B accepts structured input
plan: TravelPlan = await agent_plan.run(
    f"Create a travel plan based on: {params.model_dump_json()}",
    response_format=TravelPlan,
)

Pros: type safety, validation at every step, easy to log and debug.

Cons: structured output overhead (not all models support it equally well), schema rigidity.

Approach 2: MCP (Model Context Protocol)

Agents interact through a standardized protocol. Each agent is an MCP server that exposes tools. The orchestrator calls the tools of the appropriate agent.

Orchestrator (Claude Code)
    │
    ├── MCP: agent-planner
    │     └── tool: create_plan(destination, days, interests)
    │
    ├── MCP: agent-validator
    │     └── tool: validate_plan(plan_json)
    │
    └── MCP: agent-writer
          └── tool: write_description(day_plan)

MCP provides a standard interface for connecting agents. An agent server can be written in any language and run locally or remotely. More on building production MCP servers in the custom MCP servers article.

In practice, MCP is better suited for dev-time agents (IDE assistants), while structured output is better for runtime agents (handling user requests in production).

The Orchestrator: Central Component of the System

The orchestrator decides: which agent to call, in what order, and what to do with errors. A minimal implementation:

import asyncio
import logging
from typing import Any

logger = logging.getLogger(__name__)

class Orchestrator:
    def __init__(self, agents: dict[str, AgentConfig]):
        self.agents = agents

    async def run_agent(self, name: str, input_data: str) -> Any:
        """Run a single agent with timeout and retry."""
        config = self.agents[name]

        for attempt in range(3):
            try:
                result = await asyncio.wait_for(
                    call_llm(
                        model=config.model,
                        system=config.system_prompt,
                        user=input_data,
                        temperature=config.temperature,
                        max_tokens=config.max_tokens,
                        tools=config.tools,
                    ),
                    timeout=config.timeout_seconds,
                )
                logger.info(f"Agent {name} completed", extra={
                    "agent": name,
                    "attempt": attempt + 1,
                    "model": config.model,
                })
                return result

            except asyncio.TimeoutError:
                logger.warning(f"Agent {name} timeout, attempt {attempt + 1}")
                continue
            except Exception as e:
                logger.error(f"Agent {name} failed: {e}")
                if attempt == 2:
                    raise

        raise RuntimeError(f"Agent {name} failed after 3 attempts")

    async def run_pipeline(self, user_input: str) -> str:
        """Sequential pipeline: classify → plan → validate → format."""

        # Step 1: Classification
        intent = await self.run_agent("classifier", user_input)

        if intent == "general":
            return await self.run_agent("writer", user_input)

        # Step 2: Extract parameters
        params = await self.run_agent("extractor", user_input)

        # Step 3: Generate plan
        plan = await self.run_agent("planner", params)

        # Step 4: Parallel validation
        validation, quality_score = await asyncio.gather(
            self.run_agent("validator", plan),
            self.run_agent("judge", plan),
        )

        # Step 5: If validation failed — retry with feedback
        if not validation.get("is_valid"):
            plan = await self.run_agent("planner",
                f"Previous plan had issues: {validation['issues']}. "
                f"Original request: {params}. Fix the plan."
            )

        return plan

Key design decisions in the orchestrator:

Retry with context. On retry, the agent receives information about what went wrong. Not a blind retry, but a retry with validator feedback.
Parallelize where possible. Validator and Judge run concurrently — saving 30–50% of total time.
Graceful degradation. If Judge is unavailable, the pipeline continues without quality scoring. A response without a score beats no response at all.

Observability: Seeing What’s Happening Inside

A multi-agent system without observability is a black box. Three essential components.

1. Tracing

Every user request is a trace. Every agent call is a span within that trace. Langfuse is a great fit here:

from langfuse import Langfuse

langfuse = Langfuse()

async def run_with_tracing(user_input: str) -> str:
    trace = langfuse.trace(
        name="travel-planning",
        input=user_input,
    )

    # Classifier span
    classifier_span = trace.span(name="classifier")
    intent = await run_agent("classifier", user_input)
    classifier_span.end(output=intent)

    # Planner span
    planner_span = trace.span(name="planner")
    plan = await run_agent("planner", user_input)
    planner_span.end(output=plan)

    # ... other agents

    trace.update(output=plan)
    return plan

In Langfuse you can see: which agents were called, how long each took, how many tokens each consumed, and what each returned. More on setup in the LLM Observability with Langfuse article.

2. Metrics

A minimal set of metrics for a multi-agent system:

Metric	What it shows	Alert threshold
`agent.latency_p95`	Speed per agent	> 10s for classifier
`agent.error_rate`	Error rate per agent	> 5%
`agent.token_cost`	Cost per agent	> $X/day budget
`pipeline.success_rate`	Successful pipeline runs	< 95%
`pipeline.retry_rate`	How often retries are needed	> 20%
`judge.quality_score`	Average quality score	< 0.7

3. Decision Logging

Every orchestrator decision gets logged: why this agent was chosen, why a retry happened, why the pipeline took an alternate path. Without this, debugging is pure guesswork.

logger.info("routing_decision", extra={
    "trace_id": trace.id,
    "intent": intent,
    "selected_agent": "planner",
    "reason": "intent=travel_plan, confidence=0.95",
    "fallback_agent": "writer",
})

Cost: Multi-Agent System vs. One Big Prompt

More agents = more LLM calls = more cost. But a well-designed architecture can actually be cheaper than a single large prompt.

Cost comparison for “plan a trip”:

Approach	Calls	Models	Tokens (input)	Tokens (output)	Estimated cost
Single agent	1	Claude Sonnet	~3000	~2000	~$0.025 (approximate)
Pipeline (5 agents)	5	Mix	~2500 total	~1800 total	~$0.008 (approximate)

The pipeline is cheaper because:

Classifier uses Flash (~$0.0001 per call)
Extractor uses DeepSeek (~$0.27/1M input tokens)
Only the Planner uses an expensive model, and its prompt is shorter (parameters are already extracted)

Result: 60–70% savings on LLM costs with proper task splitting and model routing.

Common Mistakes in Multi-Agent System Design

Over-engineering. Three agents for a task that a single prompt with structured output could handle. If the task is linear and doesn’t require different models — one agent is simpler and more reliable.

No fallback. The planner agent goes down — the entire pipeline is dead. Every critical agent needs a fallback: an alternative model, a simplified prompt, or a cache of previous results.

Chains that are too long. Each agent adds latency and a potential failure point. An 8-agent pipeline with P95 latency of 3 seconds per agent means 24 seconds per response. Users won’t wait.

Ignoring the context window. Passing the full output of one agent into the next agent’s prompt. A 5,000-token planner output plus a 500-token validator system prompt — unnecessary cost. Pass only what the current agent actually needs.

No circuit breaker. One agent returns malformed JSON. The next agent gets broken input, fails, the orchestrator retries — three attempts, three failures, wasted tokens. Circuit breaker: if an agent returns invalid output, don’t pass it downstream — surface the error immediately.

Where to Start

Identify the bottleneck. Take your current single-agent pipeline. Find the step that breaks most often or produces poor quality. Extract it into a separate agent.
Start with two agents. Classifier + Worker. The Classifier identifies the task type on a cheap model. The Worker handles it on the right model. This already gives you routing and cost savings.
Add structured output. Pydantic schemas for inputs and outputs of each agent. Validation at every step. Without this, debugging a multi-agent system is nearly impossible.
Connect tracing. Langfuse, LangSmith, or equivalent. See every call, every prompt, every result. Without observability, a multi-agent system is a black box.
Add a Judge agent. Automatic quality scoring at the end of the pipeline. If the score falls below threshold — retry with feedback. This closes the quality loop.
Optimize from metrics. Watch latency, cost, and quality score per agent. Swap models, tweak prompts, and reshape architecture based on real data.

Need help with multi-agent systems? I help startups build AI products and automate processes — belov.works.

FAQ

What is the actual latency overhead of adding an orchestration layer on top of existing LLM calls?

The orchestration logic itself adds negligible overhead — a few milliseconds for Python async scheduling. The real cost is the network round-trips: each agent boundary is a separate HTTP call with its own P95 tail latency. In a 5-agent sequential pipeline where each agent takes 800ms P95, you accumulate ~4 seconds — not from orchestration but from serial LLM calls. Converting independent steps to parallel execution (asyncio.gather) is the highest-leverage optimization, often cutting total pipeline latency by 40–60%.

How do you handle context accumulation across a long pipeline without blowing the token budget?

Each agent should receive only what it needs, not the full output of every upstream agent. A Planner that generates 3,000 tokens should not pass that entire block to a downstream Validator — instead, pass a structured summary or the specific fields the Validator needs. Use Pydantic schemas to enforce selective field passing at agent boundaries. In practice, trimming context propagation reduces token spend by 30–50% per pipeline run compared to naively chaining raw text outputs.

Can a multi-agent system be cost-effective even for small teams with low request volumes?

Yes, but the architecture matters more than the volume. A Classifier + Worker setup with a cheap model (Gemini Flash at ~$0.0001/call, approximate) handling routing and a more expensive model only for complex generation can cost less per request than a single Claude Sonnet call for everything — even at hundreds of requests per day. The break-even threshold is roughly when 60%+ of requests can be handled by a cheap specialized agent, which is typical for most chatbot and document-processing workloads.