Why Most Agent Demos Break in Production
The gap between a convincing agent demo and a reliable production agent is enormous. In a demo, the prompt is hand-tuned for the happy path, tools always return clean data, and the conversation wraps up in three turns. In production, tools fail intermittently, users send malformed requests, the context window fills up after twenty turns, and the model occasionally calls a tool that does not exist in your registry.
The root cause is almost always architectural. Demos treat agents as a single LLM call with a list of functions bolted on. Real agents are state machines with multiple decision branches, fallback paths, bounded memory, and distinct error surfaces at every layer — the API call, the tool execution, and the model output itself. Building them reliably requires engineering each of these layers deliberately.
| Failure Mode | Root Cause | Fix |
|---|---|---|
| Tool call loops (model calls same tool repeatedly) | No iteration cap, ambiguous tool description | Max iterations + explicit idempotency docs |
| Hallucinated tool names or parameters | Vague descriptions, too many tools per call | Precise descriptions, <20 tools per invocation |
| Context overflow after 10+ conversation turns | No message trimming strategy | Sliding window + fact extraction to external store |
| Silent wrong answers on tool errors | Exceptions swallowed, returned as empty success | Structured error dicts with error field |
| Rate limit cascades across concurrent sessions | No retry logic, shared API client state | Exponential backoff + per-session client |
Tool Design: Schemas the Model Actually Parses
The quality of your tool descriptions determines agent reliability more than the model you choose. The model reads the description field at inference time — it is not a comment or docs sidebar, it is a prompt fragment that shapes every decision to call or skip a tool. Vague descriptions produce ambiguous calls. Precise descriptions produce correct calls.
Three rules for descriptions that work: (1) state when to call this tool vs similar tools, (2) document preconditions the model must check first, and (3) describe what the return value looks like so the model knows whether to use the result or handle an error branch.
# Bad: vague description, no guidance on when vs. other tools
bad_tool = {
"name": "search",
"description": "Search for information.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}
# Good: when to call it, what it excludes, what it returns
good_tool = {
"name": "search_product_catalog",
"description": (
"Search the live product catalog by keyword. Use this when the user asks about "
"product availability, pricing, or specifications. Do NOT use for order status — "
"use get_order_status instead. Returns a list of products with id, name, price_usd, "
"and in_stock fields. Returns an empty list if no matches are found."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search terms. Use natural language, not product IDs.",
},
"limit": {
"type": "integer",
"description": "Max results to return (1-20). Default: 5.",
"minimum": 1,
"maximum": 20,
},
"in_stock_only": {
"type": "boolean",
"description": "Set true to filter out out-of-stock items.",
},
},
"required": ["query"],
},
}Dangerous tools — anything that writes, charges, or deletes — need a confirmation gate encoded in the description. The model follows this instruction reliably when it is explicit:
create_order_tool = {
"name": "create_order",
"description": (
"Create a purchase order. IMPORTANT: only call this after "
"(1) confirming the product ID exists via search_product_catalog, "
"(2) verifying the item is in stock, and "
"(3) receiving explicit confirmation from the customer. "
"This action charges the customer and cannot be undone through this interface. "
"Returns an order_id string on success, or a dict with an 'error' key on failure."
),
"input_schema": {
"type": "object",
"properties": {
"customer_id": {"type": "string"},
"product_id": {
"type": "string",
"description": "Must be an id from search_product_catalog results.",
},
"quantity": {"type": "integer", "minimum": 1},
},
"required": ["customer_id", "product_id", "quantity"],
},
}Note
The Agent Loop: ReAct in Practice
The ReAct pattern (Reason + Act) formalizes what a reliable loop looks like: the model reasons about what to do, calls a tool, observes the result, and repeats until it has enough information to give a final answer. This maps cleanly to the Anthropic tool use API: stop_reason: "tool_use" means the model wants a tool result, you execute and return it as tool_result blocks, and the loop continues until stop_reason: "end_turn".
The single most important engineering decision is the iteration cap. Without one, a model stuck in a reasoning loop or receiving unexpected tool output can run indefinitely, burning tokens and API budget. Ten iterations handles nearly every real-world task — if an agent needs more, the task decomposition is the problem.
import anthropic
import json
from typing import Any, Callable
def run_agent(
system_prompt: str,
user_message: str,
tools: list[dict],
tool_handlers: dict[str, Callable[..., Any]],
max_iterations: int = 10,
model: str = "claude-opus-4-7",
) -> str:
"""Run a ReAct agent loop until end_turn or max_iterations."""
client = anthropic.Anthropic()
messages: list[dict] = [{"role": "user", "content": user_message}]
for iteration in range(max_iterations):
response = client.messages.create(
model=model,
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=messages,
)
# Append assistant turn so the model sees its own reasoning
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
return ""
if response.stop_reason != "tool_use":
# max_tokens hit or unexpected stop — return whatever text exists
for block in response.content:
if hasattr(block, "text"):
return block.text
return ""
# Execute every tool call in this turn and collect results
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
handler = tool_handlers.get(block.name)
if handler is None:
result: Any = {
"error": f"Tool '{block.name}' is not registered.",
"available_tools": list(tool_handlers.keys()),
}
else:
try:
result = handler(**block.input)
except Exception as exc:
result = {"error": type(exc).__name__, "detail": str(exc)}
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result) if not isinstance(result, str) else result,
})
messages.append({"role": "user", "content": tool_results})
raise RuntimeError(
f"Agent did not complete within {max_iterations} iterations. "
f"Last stop_reason: {response.stop_reason}"
)
# Wire it together
def search_products(query: str, limit: int = 5, in_stock_only: bool = False) -> list[dict]:
# Your actual database query here
return [{"id": "p-42", "name": "Widget Pro", "price_usd": 29.99, "in_stock": True}]
answer = run_agent(
system_prompt="You are a helpful product assistant. Use tools to answer questions accurately.",
user_message="Do you have any widgets in stock under $50?",
tools=[good_tool],
tool_handlers={"search_product_catalog": search_products},
)Note
{"error": "...", "detail": "..."} dict when tool handlers raise exceptions — never let raw Python tracebacks surface as tool result content. The model reads the result and can reason about errors: try an alternative approach, ask the user for clarification, or explain the limitation. A bare traceback confuses the model and wastes tokens on irrelevant stack frames.Memory Architecture: Four Layers
Memory is the most underspecified component in most agent implementations. Developers wire up a messages list and call it done, then hit context limits at turn fifteen. Production agents need four distinct memory layers, each with different capacity, latency, and access semantics:
1. Working memory (in-context)
The current messages array passed to the model. Highest fidelity, zero latency, limited to the context window (100k–200k tokens depending on model). Exhausted fastest in tool-heavy agents because tool results inflate the conversation. Requires active management: trim old turns, summarize threads, or offload facts to external memory before the window fills.
2. External key-value memory
A fast store (Redis, DynamoDB) for structured agent state: user preferences, task checkpoints, session flags. Retrieved deterministically by key. Use this for anything the agent needs to remember across sessions that can be expressed as structured data. Inject relevant entries into the system prompt at session start, or expose them via a get_memory(key) tool call.
3. Semantic / vector memory
Embeddings of past conversations, documents, or knowledge retrieved by similarity. Use Qdrant or pgvector for retrieval. Enables long-term recall without burning context tokens on irrelevant history. The agent retrieves the top-k most relevant chunks and injects them as context before the first LLM call.
4. Procedural / instruction memory
Stable behavioral rules, personas, and domain knowledge in the system prompt. These rarely change but have the highest impact on agent behavior. Store them in a versioned prompt registry rather than hardcoded strings — treat a system prompt change as a deployment: version it, run your eval suite, and roll back if quality degrades.
The most impactful detail is active working memory management. Here is a minimal class that handles trimming and persistent fact injection:
from dataclasses import dataclass, field
from typing import Any, Optional
import json
from pathlib import Path
@dataclass
class AgentMemory:
"""Working memory manager with overflow protection and cross-session persistence."""
messages: list[dict] = field(default_factory=list)
# Short facts extracted from prior sessions, injected into the system prompt
facts: dict[str, str] = field(default_factory=dict)
# Structured state accessible via tool calls
state: dict[str, Any] = field(default_factory=dict)
_persist_path: Optional[Path] = None
def inject_context(self, system_prompt: str) -> str:
"""Prepend remembered facts so the model has cross-session context."""
if not self.facts:
return system_prompt
lines = [f"- {k}: {v}" for k, v in self.facts.items()]
facts_block = "
".join(lines)
return f"{system_prompt}
Known context from prior sessions:
{facts_block}"
def trim_to_fit(self, max_pairs: int = 20) -> None:
"""Drop oldest user/assistant pairs when history grows too long.
Preserves the first message (often the original task) and removes
in user/assistant pairs to maintain conversation coherence.
"""
if len(self.messages) <= max_pairs * 2:
return
preserved = self.messages[:1]
tail = self.messages[-(max_pairs * 2 - 1):]
self.messages = preserved + tail
def remember(self, key: str, value: str) -> None:
"""Persist a fact for injection in future sessions."""
self.facts[key] = value
self._save()
def _save(self) -> None:
if self._persist_path:
data = {"facts": self.facts, "state": self.state}
self._persist_path.write_text(json.dumps(data, indent=2))
@classmethod
def load(cls, path: Path) -> "AgentMemory":
mem = cls(_persist_path=path)
if path.exists():
data = json.loads(path.read_text())
mem.facts = data.get("facts", {})
mem.state = data.get("state", {})
return mem
# Usage: load at session start, pass enriched prompt to run_agent
memory = AgentMemory.load(Path("agent_state.json"))
enriched_system = memory.inject_context(base_system_prompt)
# ... run agent loop ...
memory.trim_to_fit(max_pairs=20) # call before each LLM invocation
memory.remember("user_currency", "EUR") # persist facts extracted during the sessionError Recovery: Retry, Validate, and Degrade Gracefully
Agents fail at three distinct layers: the LLM API (rate limits, timeouts, 5xx overload), tool execution (network failures, validation errors, business logic exceptions), and model output (hallucinated tool name, missing required parameter). Each layer needs a different strategy.
API-level failures are transient and retry-safe. The Anthropic API returns 429 for rate limits and 529 for overload. A jittered exponential backoff prevents thundering herd when multiple agent instances hit limits simultaneously:
import time
import random
import anthropic
from anthropic import RateLimitError, APIStatusError, APIConnectionError
from typing import Any, Callable, TypeVar
T = TypeVar("T")
def call_with_retry(
fn: Callable[..., T],
*args: Any,
max_retries: int = 4,
base_delay: float = 1.0,
**kwargs: Any,
) -> T:
"""Exponential backoff with jitter for transient API errors."""
last_exc: Exception | None = None
for attempt in range(max_retries + 1):
try:
return fn(*args, **kwargs)
except RateLimitError as exc:
last_exc = exc
# Honor Retry-After header if the API provides it
retry_after = float(getattr(exc, "retry_after", None) or 0)
delay = max(retry_after, base_delay * (2 ** attempt)) + random.uniform(0, 0.5)
except APIStatusError as exc:
if exc.status_code not in {500, 529}:
raise # 4xx errors (except 429) are not retryable
last_exc = exc
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.3)
except APIConnectionError as exc:
last_exc = exc
delay = base_delay * (2 ** attempt)
if attempt < max_retries:
time.sleep(delay)
raise RuntimeError(f"Failed after {max_retries} retries") from last_excTool-level failures should surface to the model as structured error results. The model can then reason about the failure — request clarification, try an alternative tool, or explain the limitation. Never raise exceptions from tool handlers:
from typing import Any
def safe_tool_call(handler: Callable, **kwargs: Any) -> dict:
"""Wrap any tool call to guarantee a structured result dict."""
try:
result = handler(**kwargs)
return {"ok": True, "result": result}
except ValueError as exc:
return {"ok": False, "error": "invalid_input", "detail": str(exc)}
except TimeoutError:
return {
"ok": False,
"error": "timeout",
"detail": "The service did not respond in time. Please try again.",
}
except PermissionError as exc:
return {"ok": False, "error": "permission_denied", "detail": str(exc)}
except Exception as exc:
return {"ok": False, "error": "unexpected", "detail": f"{type(exc).__name__}: {exc}"}
# Graceful degradation: fall back to a simpler path if the primary tool fails
def search_with_fallback(query: str) -> dict:
result = safe_tool_call(search_product_catalog, query=query)
if result["ok"]:
return result["result"]
# Attempt a simpler keyword-only match before giving up
fallback = safe_tool_call(search_product_catalog_basic, query=query)
if fallback["ok"]:
return {
"results": fallback["result"],
"note": "Using basic search — full-text search is temporarily unavailable.",
}
return {
"error": "Search is temporarily unavailable.",
"detail": result["detail"],
}Note
ValidationError with a precise message you can return as a structured error. The model self-corrects on the next turn rather than propagating invalid data into your database or downstream API.Observability: What to Instrument in Agent Systems
Standard APM metrics — request latency, error rate, throughput — are necessary but not sufficient for agents. Agents introduce new failure dimensions: iteration count, tool call distribution, and model reasoning quality. Without agent-specific telemetry, debugging a production failure means reading raw conversation logs by hand.
Instrument at least four levels:
Agent-level span
One trace per agent invocation. Record: agent name, total iterations, total input/output tokens, total wall-clock duration, and final outcome (success / max_iterations_exceeded / error). This is the rollup metric your SLOs are built on.
LLM call span
One child span per messages.create() call. Record: model name, input/output tokens, stop_reason, latency, number of tools available. This surfaces which iterations are expensive and which stop reasons correlate with downstream failures.
Tool call span
One child span per tool execution. Record: tool name, sanitized input parameters, output size in bytes, latency, and success/failure. This directly shows which tools are called most, which are slowest, and which fail most often — the primary input to reliability engineering prioritization.
Conversation snapshot
Store full conversation logs (PII masked) in a queryable store, each entry linked to the trace ID. When a user reports a wrong answer, you need the full message history, not just the metrics. Log one JSON object per message and index by agent name and date.
from opentelemetry import trace
tracer = trace.get_tracer("agent.runtime")
def run_agent_instrumented(
system_prompt: str,
user_message: str,
tools: list[dict],
tool_handlers: dict,
max_iterations: int = 10,
agent_name: str = "default",
) -> str:
with tracer.start_as_current_span(f"agent.{agent_name}") as span:
span.set_attribute("agent.name", agent_name)
span.set_attribute("agent.max_iterations", max_iterations)
span.set_attribute("agent.tools_count", len(tools))
try:
result = run_agent(
system_prompt=system_prompt,
user_message=user_message,
tools=tools,
tool_handlers=tool_handlers,
max_iterations=max_iterations,
)
span.set_attribute("agent.outcome", "success")
return result
except RuntimeError as exc:
span.set_attribute("agent.outcome", "max_iterations_exceeded")
span.record_exception(exc)
raise
except Exception as exc:
span.set_attribute("agent.outcome", "error")
span.record_exception(exc)
raise
def instrument_tool_handler(name: str, handler: Callable) -> Callable:
"""Wrap a tool handler to emit a child span with latency and outcome."""
def wrapped(**kwargs: Any) -> Any:
with tracer.start_as_current_span(f"tool.{name}") as span:
span.set_attribute("tool.name", name)
start = time.perf_counter()
try:
result = handler(**kwargs)
span.set_attribute("tool.outcome", "success")
return result
except Exception as exc:
span.set_attribute("tool.outcome", "error")
span.record_exception(exc)
raise
finally:
span.set_attribute("tool.duration_ms", (time.perf_counter() - start) * 1000)
return wrappedProduction Checklist
Cap iterations and raise a distinct error type
Every agent loop needs a hard iteration cap (default: 10). When the cap is hit, raise a named exception (AgentMaxIterationsError) rather than returning a partial result. Log the last three messages at WARNING level for debugging. Alert if more than 5% of agent runs reach the cap — it indicates a tool description or task decomposition problem, not a transient failure.
Version your system prompts like code
Store system prompts in a version-controlled YAML file or prompt registry. Never edit prompts directly in production. Run your eval suite before any prompt change reaches live traffic, and log the prompt hash in every trace. When behavior regresses, you need to know whether the change came from a model update or a prompt edit — without version tracking you cannot tell.
Sanitize PII from tool results before logging
Tool results appear in conversation logs and traces. If a tool returns PII — email addresses, SSNs, payment data — apply a masking function before logging. The model receives the full result for its reasoning; your telemetry system stores only the masked version. Use Microsoft Presidio for automated PII detection and redaction.
Run an eval suite before every model or prompt change
Build a golden dataset of 50–100 representative agent interactions with known correct outputs or labeled behavior. Run this suite automatically when you change models, system prompts, or tool descriptions. Track task completion rate, tool selection accuracy, and answer quality. A model upgrade that improves general reasoning can still degrade specific tool-calling patterns — evals catch this before it reaches production.
Gate irreversible actions behind a human-in-the-loop step
For agents that can send emails, charge customers, or delete data, add a confirmation gate: the agent produces a proposed_action output, your application layer presents it for approval (via UI or Slack webhook), and execution resumes only after explicit sign-off. The Anthropic tool use documentation covers patterns for pausing and resuming agent execution across HTTP requests.
Set per-session token budgets and alert on spend anomalies
A single misconfigured agent in an infinite retry loop can consume hundreds of dollars in minutes. Count input + output tokens across all LLM calls in a session and abort with a budget-exceeded error when the limit is hit. Alert on P99 token spend per agent type so you catch cost anomalies before they compound. Typical budget: 50k tokens per session for a task-focused agent, 200k for a research agent with many tool calls.
Building AI agents for your product or internal tooling?
We design and implement production-grade AI agent systems — from tool orchestration and memory architecture to error recovery, observability, and evaluation pipelines. Let’s talk.
Get in Touch