Multi-agent systems look brilliant in demos. A planner delegates to a researcher, a researcher queries a database, a synthesizer writes the report. The demo runs in twelve seconds. Six weeks later, in production, the whole thing hangs every third request and no one knows why.
We have built, debugged, and rescued enough agentic systems to have an opinionated list of what actually breaks — and it is rarely the model. The model is the last thing that fails. The infrastructure around it fails first.
Failure pattern 1: Unbounded loops with no circuit breaker
LLM orchestrators — LangGraph, CrewAI, custom implementations — default to retry on failure. An agent calls a tool, the tool returns an error, the agent retries. Without an explicit iteration ceiling and a global wall-clock timeout, one stuck sub-agent blocks the entire workflow indefinitely.
The fix is two-layered. Every sub-agent needs a max-iterations guard. Every top-level workflow needs a wall-clock budget. We typically set sub-agent max turns to 5–8 and root-level timeout to 30–45 seconds for interactive flows, 10–15 minutes for background batch flows.
# Defensive orchestrator pattern
MAX_TURNS = 6
WALL_CLOCK = 30 # seconds
async def run_agent(task: str, ctx: Context) -> Result:
start = time.monotonic()
for turn in range(MAX_TURNS):
if time.monotonic() - start > WALL_CLOCK:
raise AgentTimeoutError(f"Exceeded {WALL_CLOCK}s budget")
result = await agent.step(task, ctx)
if result.is_terminal:
return result
raise AgentMaxTurnsError(f"Exceeded {MAX_TURNS} turns")Failure pattern 2: State mutation races
Parallel sub-agents sharing a mutable context dict — a common pattern because it looks clean — produce non-deterministic results under load. Sub-agent A reads the tool-results list, sub-agent B appends to it, sub-agent A writes back a stale copy. The second result is silently dropped.
The correct model is immutable message passing. Each agent step receives the current state snapshot and returns a delta. The orchestrator applies deltas in order. This is exactly what LangGraph's StateGraph does when you use annotated reducer fields — but teams frequently bypass it by passing mutable dicts.
Failure pattern 3: Prompt coupling between agents
Teams tune an orchestrator prompt and a worker prompt together, iteratively, until they work. Then they swap the model. The inter-agent communication was implicitly calibrated to one model's verbosity, its tool-call formatting, and its tendency to add JSON fences or not. Switching models — even to a better one — breaks the hand-off.
Design inter-agent interfaces as typed contracts, not prose. Define a Pydantic schema for every message between agents. Validate at the boundary. The model is a detail; the schema is the contract.
Failure pattern 4: No observability past the top-level trace
Langfuse or LangSmith captures the root span. It does not automatically capture what happened inside a sub-agent running in a background task, or why a tool call returned an empty result set. Teams debug production incidents by staring at the final output and guessing.
- Propagate trace context (trace_id, parent_span_id) through every async boundary
- Log every tool call — arguments, latency, result status — as a child span
- Emit structured events for agent decisions: why did the planner choose sub-agent B over sub-agent C?
- Track token consumption per agent per run — cost spikes pinpoint which agent is prompt-bloating
Failure pattern 5: Tool schemas that model edge cases incorrectly
Tool descriptions are few-shot examples for the model. If your search tool description says "returns a list of documents" and the actual implementation sometimes returns an empty list, sometimes a list with null entries, and sometimes throws on downstream API failure, the model will hallucinate a successful result because nothing in its tool schema prepared it for those states.
Write tool schemas defensively. Explicitly document the error states. Return typed discriminated unions, not bare dicts. The model reasons better when it has an explicit error variant to pattern-match against.
The structural fix: treat agents like distributed services
Every pattern above has the same root cause: teams build multi-agent systems with the mental model of a single LLM call, not a distributed system. But a multi-agent workflow is a distributed system. Apply the same engineering discipline:
- 01Explicit timeouts at every boundary — not just the outermost
- 02Idempotent operations — retries must produce the same result
- 03Structured logging with distributed trace propagation
- 04Schema-driven interfaces between components
- 05Failure modes documented and tested, not discovered in production