What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Building AI Agents That Actually Work — Tool Orchestration, Memory, and Error Recovery

Why Most Agent Demos Break in Production

The gap between a convincing agent demo and a reliable production agent is enormous. In a demo, the prompt is hand-tuned for the happy path, tools always return clean data, and the conversation wraps up in three turns. In production, tools fail intermittently, users send malformed requests, the context window fills up after twenty turns, and the model occasionally calls a tool that does not exist in your registry.

The root cause is almost always architectural. Demos treat agents as a single LLM call with a list of functions bolted on. Real agents are state machines with multiple decision branches, fallback paths, bounded memory, and distinct error surfaces at every layer — the API call, the tool execution, and the model output itself. Building them reliably requires engineering each of these layers deliberately. If you are still choosing which AI coding tool to build or deploy with, see our comparison of Claude Code vs OpenCode for a breakdown of architecture, model support, and AI-assisted development trade-offs.

Failure Mode	Root Cause	Fix
Tool call loops (model calls same tool repeatedly)	No iteration cap, ambiguous tool description	Max iterations + explicit idempotency docs
Hallucinated tool names or parameters	Vague descriptions, too many tools per call	Precise descriptions, <20 tools per invocation
Context overflow after 10+ conversation turns	No message trimming strategy	Sliding window + fact extraction to external store
Silent wrong answers on tool errors	Exceptions swallowed, returned as empty success	Structured error dicts with `error` field
Rate limit cascades across concurrent sessions	No retry logic, shared API client state	Exponential backoff + per-session client

Tool Design: Schemas the Model Actually Parses

The quality of your tool descriptions determines agent reliability more than the model you choose. The model reads the description field at inference time — it is not a comment or docs sidebar, it is a prompt fragment that shapes every decision to call or skip a tool. Vague descriptions produce ambiguous calls. Precise descriptions produce correct calls.

Three rules for descriptions that work: (1) state when to call this tool vs similar tools, (2) document preconditions the model must check first, and (3) describe what the return value looks like so the model knows whether to use the result or handle an error branch.

# Bad: vague description, no guidance on when vs. other tools
bad_tool = {
    "name": "search",
    "description": "Search for information.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
}

# Good: when to call it, what it excludes, what it returns
good_tool = {
    "name": "search_product_catalog",
    "description": (
        "Search the live product catalog by keyword. Use this when the user asks about "
        "product availability, pricing, or specifications. Do NOT use for order status — "
        "use get_order_status instead. Returns a list of products with id, name, price_usd, "
        "and in_stock fields. Returns an empty list if no matches are found."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search terms. Use natural language, not product IDs.",
            },
            "limit": {
                "type": "integer",
                "description": "Max results to return (1-20). Default: 5.",
                "minimum": 1,
                "maximum": 20,
            },
            "in_stock_only": {
                "type": "boolean",
                "description": "Set true to filter out out-of-stock items.",
            },
        },
        "required": ["query"],
    },
}

Dangerous tools — anything that writes, charges, or deletes — need a confirmation gate encoded in the description. The model follows this instruction reliably when it is explicit:

create_order_tool = {
    "name": "create_order",
    "description": (
        "Create a purchase order. IMPORTANT: only call this after "
        "(1) confirming the product ID exists via search_product_catalog, "
        "(2) verifying the item is in stock, and "
        "(3) receiving explicit confirmation from the customer. "
        "This action charges the customer and cannot be undone through this interface. "
        "Returns an order_id string on success, or a dict with an 'error' key on failure."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "customer_id": {"type": "string"},
            "product_id": {
                "type": "string",
                "description": "Must be an id from search_product_catalog results.",
            },
            "quantity": {"type": "integer", "minimum": 1},
        },
        "required": ["customer_id", "product_id", "quantity"],
    },
}

Note

Keep your tool registry under 20 tools per agent invocation. Model tool-call accuracy degrades measurably above this threshold — the model begins mixing up similar tools and hallucinating parameter names. If your domain genuinely needs more, route requests through a task-routing layer that selects a specialized sub-agent with a smaller, focused tool set.

The Agent Loop: ReAct in Practice

The ReAct pattern (Reason + Act) formalizes what a reliable loop looks like: the model reasons about what to do, calls a tool, observes the result, and repeats until it has enough information to give a final answer. This maps cleanly to the Anthropic tool use API: stop_reason: "tool_use" means the model wants a tool result, you execute and return it as tool_result blocks, and the loop continues until stop_reason: "end_turn".

The single most important engineering decision is the iteration cap. Without one, a model stuck in a reasoning loop or receiving unexpected tool output can run indefinitely, burning tokens and API budget. Ten iterations handles nearly every real-world task — if an agent needs more, the task decomposition is the problem.

import anthropic
import json
from typing import Any, Callable


def run_agent(
    system_prompt: str,
    user_message: str,
    tools: list[dict],
    tool_handlers: dict[str, Callable[..., Any]],
    max_iterations: int = 10,
    model: str = "claude-opus-4-7",
) -> str:
    """Run a ReAct agent loop until end_turn or max_iterations."""
    client = anthropic.Anthropic()
    messages: list[dict] = [{"role": "user", "content": user_message}]

    for iteration in range(max_iterations):
        response = client.messages.create(
            model=model,
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )

        # Append assistant turn so the model sees its own reasoning
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""

        if response.stop_reason != "tool_use":
            # max_tokens hit or unexpected stop — return whatever text exists
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""

        # Execute every tool call in this turn and collect results
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue

            handler = tool_handlers.get(block.name)
            if handler is None:
                result: Any = {
                    "error": f"Tool '{block.name}' is not registered.",
                    "available_tools": list(tool_handlers.keys()),
                }
            else:
                try:
                    result = handler(**block.input)
                except Exception as exc:
                    result = {"error": type(exc).__name__, "detail": str(exc)}

            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result) if not isinstance(result, str) else result,
            })

        messages.append({"role": "user", "content": tool_results})

    raise RuntimeError(
        f"Agent did not complete within {max_iterations} iterations. "
        f"Last stop_reason: {response.stop_reason}"
    )


# Wire it together
def search_products(query: str, limit: int = 5, in_stock_only: bool = False) -> list[dict]:
    # Your actual database query here
    return [{"id": "p-42", "name": "Widget Pro", "price_usd": 29.99, "in_stock": True}]

answer = run_agent(
    system_prompt="You are a helpful product assistant. Use tools to answer questions accurately.",
    user_message="Do you have any widgets in stock under $50?",
    tools=[good_tool],
    tool_handlers={"search_product_catalog": search_products},
)

Note

Always return a structured {"error": "...", "detail": "..."} dict when tool handlers raise exceptions — never let raw Python tracebacks surface as tool result content. The model reads the result and can reason about errors: try an alternative approach, ask the user for clarification, or explain the limitation. A bare traceback confuses the model and wastes tokens on irrelevant stack frames.

Memory Architecture: Four Layers

Memory is the most underspecified component in most agent implementations. Developers wire up a messages list and call it done, then hit context limits at turn fifteen. Production agents need four distinct memory layers, each with different capacity, latency, and access semantics:

1. Working memory (in-context)

The current messages array passed to the model. Highest fidelity, zero latency, limited to the context window (100k–200k tokens depending on model). Exhausted fastest in tool-heavy agents because tool results inflate the conversation. Requires active management: trim old turns, summarize threads, or offload facts to external memory before the window fills.

2. External key-value memory

A fast store (Redis, DynamoDB) for structured agent state: user preferences, task checkpoints, session flags. Retrieved deterministically by key. Use this for anything the agent needs to remember across sessions that can be expressed as structured data. Inject relevant entries into the system prompt at session start, or expose them via a get_memory(key) tool call.

3. Semantic / vector memory

Embeddings of past conversations, documents, or knowledge retrieved by similarity. Use Qdrant or pgvector for retrieval. Enables long-term recall without burning context tokens on irrelevant history. The agent retrieves the top-k most relevant chunks and injects them as context before the first LLM call.

4. Procedural / instruction memory

Stable behavioral rules, personas, and domain knowledge in the system prompt. These rarely change but have the highest impact on agent behavior. Store them in a versioned prompt registry rather than hardcoded strings — treat a system prompt change as a deployment: version it, run your eval suite, and roll back if quality degrades.

The most impactful detail is active working memory management. Here is a minimal class that handles trimming and persistent fact injection:

from dataclasses import dataclass, field
from typing import Any, Optional
import json
from pathlib import Path


@dataclass
class AgentMemory:
    """Working memory manager with overflow protection and cross-session persistence."""

    messages: list[dict] = field(default_factory=list)

    # Short facts extracted from prior sessions, injected into the system prompt
    facts: dict[str, str] = field(default_factory=dict)

    # Structured state accessible via tool calls
    state: dict[str, Any] = field(default_factory=dict)

    _persist_path: Optional[Path] = None

    def inject_context(self, system_prompt: str) -> str:
        """Prepend remembered facts so the model has cross-session context."""
        if not self.facts:
            return system_prompt
        lines = [f"- {k}: {v}" for k, v in self.facts.items()]
        facts_block = "
".join(lines)
        return f"{system_prompt}

Known context from prior sessions:
{facts_block}"

    def trim_to_fit(self, max_pairs: int = 20) -> None:
        """Drop oldest user/assistant pairs when history grows too long.

        Preserves the first message (often the original task) and removes
        in user/assistant pairs to maintain conversation coherence.
        """
        if len(self.messages) <= max_pairs * 2:
            return
        preserved = self.messages[:1]
        tail = self.messages[-(max_pairs * 2 - 1):]
        self.messages = preserved + tail

    def remember(self, key: str, value: str) -> None:
        """Persist a fact for injection in future sessions."""
        self.facts[key] = value
        self._save()

    def _save(self) -> None:
        if self._persist_path:
            data = {"facts": self.facts, "state": self.state}
            self._persist_path.write_text(json.dumps(data, indent=2))

    @classmethod
    def load(cls, path: Path) -> "AgentMemory":
        mem = cls(_persist_path=path)
        if path.exists():
            data = json.loads(path.read_text())
            mem.facts = data.get("facts", {})
            mem.state = data.get("state", {})
        return mem


# Usage: load at session start, pass enriched prompt to run_agent
memory = AgentMemory.load(Path("agent_state.json"))
enriched_system = memory.inject_context(base_system_prompt)
# ... run agent loop ...
memory.trim_to_fit(max_pairs=20)   # call before each LLM invocation
memory.remember("user_currency", "EUR")  # persist facts extracted during the session

Error Recovery: Retry, Validate, and Degrade Gracefully

Agents fail at three distinct layers: the LLM API (rate limits, timeouts, 5xx overload), tool execution (network failures, validation errors, business logic exceptions), and model output (hallucinated tool name, missing required parameter). Each layer needs a different strategy.

API-level failures are transient and retry-safe. The Anthropic API returns 429 for rate limits and 529 for overload. A jittered exponential backoff prevents thundering herd when multiple agent instances hit limits simultaneously:

import time
import random
import anthropic
from anthropic import RateLimitError, APIStatusError, APIConnectionError
from typing import Any, Callable, TypeVar

T = TypeVar("T")


def call_with_retry(
    fn: Callable[..., T],
    *args: Any,
    max_retries: int = 4,
    base_delay: float = 1.0,
    **kwargs: Any,
) -> T:
    """Exponential backoff with jitter for transient API errors."""
    last_exc: Exception | None = None

    for attempt in range(max_retries + 1):
        try:
            return fn(*args, **kwargs)

        except RateLimitError as exc:
            last_exc = exc
            # Honor Retry-After header if the API provides it
            retry_after = float(getattr(exc, "retry_after", None) or 0)
            delay = max(retry_after, base_delay * (2 ** attempt)) + random.uniform(0, 0.5)

        except APIStatusError as exc:
            if exc.status_code not in {500, 529}:
                raise  # 4xx errors (except 429) are not retryable
            last_exc = exc
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.3)

        except APIConnectionError as exc:
            last_exc = exc
            delay = base_delay * (2 ** attempt)

        if attempt < max_retries:
            time.sleep(delay)

    raise RuntimeError(f"Failed after {max_retries} retries") from last_exc

Tool-level failures should surface to the model as structured error results. The model can then reason about the failure — request clarification, try an alternative tool, or explain the limitation. Never raise exceptions from tool handlers:

from typing import Any


def safe_tool_call(handler: Callable, **kwargs: Any) -> dict:
    """Wrap any tool call to guarantee a structured result dict."""
    try:
        result = handler(**kwargs)
        return {"ok": True, "result": result}
    except ValueError as exc:
        return {"ok": False, "error": "invalid_input", "detail": str(exc)}
    except TimeoutError:
        return {
            "ok": False,
            "error": "timeout",
            "detail": "The service did not respond in time. Please try again.",
        }
    except PermissionError as exc:
        return {"ok": False, "error": "permission_denied", "detail": str(exc)}
    except Exception as exc:
        return {"ok": False, "error": "unexpected", "detail": f"{type(exc).__name__}: {exc}"}


# Graceful degradation: fall back to a simpler path if the primary tool fails
def search_with_fallback(query: str) -> dict:
    result = safe_tool_call(search_product_catalog, query=query)
    if result["ok"]:
        return result["result"]

    # Attempt a simpler keyword-only match before giving up
    fallback = safe_tool_call(search_product_catalog_basic, query=query)
    if fallback["ok"]:
        return {
            "results": fallback["result"],
            "note": "Using basic search — full-text search is temporarily unavailable.",
        }

    return {
        "error": "Search is temporarily unavailable.",
        "detail": result["detail"],
    }

Note

Validate tool inputs with Pydantic before passing them to business logic. If the model passes an out-of-range integer or a string where an ID is expected, Pydantic raises a ValidationError with a precise message you can return as a structured error. The model self-corrects on the next turn rather than propagating invalid data into your database or downstream API.

Observability: What to Instrument in Agent Systems

Standard APM metrics — request latency, error rate, throughput — are necessary but not sufficient for agents. Agents introduce new failure dimensions: iteration count, tool call distribution, and model reasoning quality. Without agent-specific telemetry, debugging a production failure means reading raw conversation logs by hand. For a unified tracing foundation, see OpenTelemetry for Data Pipelines.

Instrument at least four levels:

Agent-level span

One trace per agent invocation. Record: agent name, total iterations, total input/output tokens, total wall-clock duration, and final outcome (success / max_iterations_exceeded / error). This is the rollup metric your SLOs are built on.

LLM call span

One child span per messages.create() call. Record: model name, input/output tokens, stop_reason, latency, number of tools available. This surfaces which iterations are expensive and which stop reasons correlate with downstream failures.

Tool call span

One child span per tool execution. Record: tool name, sanitized input parameters, output size in bytes, latency, and success/failure. This directly shows which tools are called most, which are slowest, and which fail most often — the primary input to reliability engineering prioritization.

Conversation snapshot

Store full conversation logs (PII masked) in a queryable store, each entry linked to the trace ID. When a user reports a wrong answer, you need the full message history, not just the metrics. Log one JSON object per message and index by agent name and date.

from opentelemetry import trace

tracer = trace.get_tracer("agent.runtime")


def run_agent_instrumented(
    system_prompt: str,
    user_message: str,
    tools: list[dict],
    tool_handlers: dict,
    max_iterations: int = 10,
    agent_name: str = "default",
) -> str:
    with tracer.start_as_current_span(f"agent.{agent_name}") as span:
        span.set_attribute("agent.name", agent_name)
        span.set_attribute("agent.max_iterations", max_iterations)
        span.set_attribute("agent.tools_count", len(tools))

        try:
            result = run_agent(
                system_prompt=system_prompt,
                user_message=user_message,
                tools=tools,
                tool_handlers=tool_handlers,
                max_iterations=max_iterations,
            )
            span.set_attribute("agent.outcome", "success")
            return result

        except RuntimeError as exc:
            span.set_attribute("agent.outcome", "max_iterations_exceeded")
            span.record_exception(exc)
            raise

        except Exception as exc:
            span.set_attribute("agent.outcome", "error")
            span.record_exception(exc)
            raise


def instrument_tool_handler(name: str, handler: Callable) -> Callable:
    """Wrap a tool handler to emit a child span with latency and outcome."""
    def wrapped(**kwargs: Any) -> Any:
        with tracer.start_as_current_span(f"tool.{name}") as span:
            span.set_attribute("tool.name", name)
            start = time.perf_counter()
            try:
                result = handler(**kwargs)
                span.set_attribute("tool.outcome", "success")
                return result
            except Exception as exc:
                span.set_attribute("tool.outcome", "error")
                span.record_exception(exc)
                raise
            finally:
                span.set_attribute("tool.duration_ms", (time.perf_counter() - start) * 1000)
    return wrapped

Production Checklist

Cap iterations and raise a distinct error type

Every agent loop needs a hard iteration cap (default: 10). When the cap is hit, raise a named exception (AgentMaxIterationsError) rather than returning a partial result. Log the last three messages at WARNING level for debugging. Alert if more than 5% of agent runs reach the cap — it indicates a tool description or task decomposition problem, not a transient failure.

Version your system prompts like code

Store system prompts in a version-controlled YAML file or prompt registry. Never edit prompts directly in production. Run your eval suite before any prompt change reaches live traffic, and log the prompt hash in every trace. When behavior regresses, you need to know whether the change came from a model update or a prompt edit — without version tracking you cannot tell.

Sanitize PII from tool results before logging

Tool results appear in conversation logs and traces. If a tool returns PII — email addresses, SSNs, payment data — apply a masking function before logging. The model receives the full result for its reasoning; your telemetry system stores only the masked version. Use Microsoft Presidio for automated PII detection and redaction.

Run an eval suite before every model or prompt change

Build a golden dataset of 50–100 representative agent interactions with known correct outputs or labeled behavior. Run this suite automatically when you change models, system prompts, or tool descriptions. Track task completion rate, tool selection accuracy, and answer quality. A model upgrade that improves general reasoning can still degrade specific tool-calling patterns — evals catch this before it reaches production.

Gate irreversible actions behind a human-in-the-loop step

For agents that can send emails, charge customers, or delete data, add a confirmation gate: the agent produces a proposed_action output, your application layer presents it for approval (via UI or Slack webhook), and execution resumes only after explicit sign-off. The Anthropic tool use documentation covers patterns for pausing and resuming agent execution across HTTP requests.

Set per-session token budgets and alert on spend anomalies

A single misconfigured agent in an infinite retry loop can consume hundreds of dollars in minutes. Count input + output tokens across all LLM calls in a session and abort with a budget-exceeded error when the limit is hit. Alert on P99 token spend per agent type so you catch cost anomalies before they compound. Typical budget: 50k tokens per session for a task-focused agent, 200k for a research agent with many tool calls.

For a deep dive into the tool-calling layer that powers agentic applications, read our overview of Model Context Protocol and how it standardises tool use across LLM providers. To ground your agents in private knowledge bases, see our production guide on retrieval-augmented generation and how retrieval fits into agent loops. When you need agents that behave consistently on narrow tasks, consider fine-tuning open models to bake domain knowledge and tool-calling patterns into the weights.

Building a CDC pipeline or migrating from polling to event-driven data sync?

We design and implement production-grade CDC architectures — from Debezium connector configuration and Kafka Connect cluster setup to sink connector design, schema evolution, and replication slot monitoring. Let’s talk.

Get in Touch

Change Data Capture in Practice — Debezium, Kafka Connect, and Sink Connectors

Why Most Agent Demos Break in Production

Tool Design: Schemas the Model Actually Parses

The Agent Loop: ReAct in Practice

Memory Architecture: Four Layers

1. Working memory (in-context)

2. External key-value memory

3. Semantic / vector memory

4. Procedural / instruction memory

Error Recovery: Retry, Validate, and Degrade Gracefully

Observability: What to Instrument in Agent Systems

Production Checklist

Cap iterations and raise a distinct error type

Version your system prompts like code

Sanitize PII from tool results before logging

Run an eval suite before every model or prompt change

Gate irreversible actions behind a human-in-the-loop step

Set per-session token budgets and alert on spend anomalies

Building a CDC pipeline or migrating from polling to event-driven data sync?

Related Articles

Need help implementing this in production?