Deep Dive: The Single ReAct Agent — Architecture, Best Practices, and Pitfalls

The single ReAct agent is the most fundamental — and most widely deployed — agent architecture in production today. It places one LLM in a reasoning loop with access to tools, iterating through Thought → Action → Observation cycles until it reaches a goal or hits a limit. GitHub Copilot’s agent mode, ChatGPT’s tool-use flows, and thousands of enterprise assistants are all built on this pattern.

This post is a deep dive into the single ReAct agent: what it is, how it works internally, what it’s used for, its strengths and weaknesses, the best practices that make it reliable, the common mistakes that make it fail, and the real-world systems that prove it works.


1. What Is a Single ReAct Agent?

A single ReAct agent is an AI system where one LLM operates in an iterative loop — reasoning about a goal, selecting and executing tool calls, observing the results, and repeating until the task is complete. The name comes from the ReAct pattern (Reason + Act), introduced by Yao et al. at ICLR 2023, which formalized the interleaving of chain-of-thought reasoning with tool execution.

Unlike a simple prompt → response exchange (stateless, single-turn), or a multi-agent system (multiple LLMs coordinating), the single ReAct agent is:

  • One LLM — a single reasoning engine making all decisions.
  • Tool-augmented — it can call external tools (APIs, databases, search engines, code interpreters) to interact with the world.
  • Iterative — it loops through reasoning and action steps, accumulating context, until it can produce a final answer.
  • Autonomous within bounds — it decides which tools to call, in what order, and when to stop — but operates within guardrails set by the developer.

“ReAct prompts LLMs to generate both verbal reasoning traces and text actions in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting, while also interacting with external environments to incorporate additional information into reasoning.” — Yao, S. et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023

The Core Loop

Every ReAct agent — regardless of framework — follows the same cycle:

  1. Thought — The LLM reasons about the current state: what it knows, what it still needs, and what to do next.
  2. Action — The LLM selects a tool and generates the arguments (e.g., search_web(query="ACME Q4 earnings")).
  3. Observation — The runtime executes the tool call and returns the result to the LLM as new context.
  4. Repeat — The LLM reads the observation, generates the next thought, and either takes another action or produces a final answer.
Detailed ReAct loop diagram showing the step-by-step flow: Thought, Action, Tool Execution, Observation, and the feedback loop back to the next Thought, ending with a Final Answer.
Figure: The ReAct loop in detail — the LLM generates Thoughts and Actions, the runtime executes tools and returns Observations, and the cycle repeats until a final answer is produced.

How It Differs from Other Patterns

Pattern Reasoning Tool Calls Iterations LLMs
Prompt / RAG Single-pass None 1 1
Single tool call Single-pass 1 1 1
Single ReAct agent Multi-step Multiple, looped Many 1
Multi-agent system Multi-step Multiple, looped Many Multiple

The single ReAct agent occupies the sweet spot: complex enough to handle multi-step tasks, simple enough to debug, deploy, and reason about.


2. Internal Architecture

Understanding what’s inside a single ReAct agent clarifies how the loop works and where things can go wrong.

Architecture diagram showing the internal components of a single ReAct agent: system prompt, LLM reasoning engine, context window with accumulated thoughts and observations, and the tool belt with external tools.
Figure: Inside a single ReAct agent — the system prompt configures behaviour, the context window accumulates the full trajectory, the LLM reasons and selects actions, and tools execute in the runtime.

2.1 System Prompt

The system prompt defines the agent’s identity, capabilities, constraints, and output format. It includes:

  • Role and goal — “You are a customer service agent for ACME Corp.”
  • Tool descriptions — Name, purpose, parameters, and when to use each tool.
  • Constraints — “Never fabricate data. If you can’t find the answer, say so.”
  • Output format — “Always respond with a JSON object containing thought, action, and arguments.”

The quality of the system prompt is the single most underrated lever on agent reliability. Anthropic’s building effective agents guide emphasises: “the system prompt is your agent’s constitution — invest in it.”

2.2 Context Window

The context window is the LLM’s working memory. At each iteration, it contains:

[System Prompt]
[User Message]
[Thought₁] → [Action₁] → [Observation₁]
[Thought₂] → [Action₂] → [Observation₂]
...
[Thoughtₙ] → [Final Answer]

Every step appends to this context. This is both the mechanism that gives the agent memory within a run and the constraint that limits how long a run can be — once the context window fills, the agent can no longer reason effectively. The “Lost in the Middle” paper (Liu et al., 2024) showed that LLMs attend best to the beginning and end of the context, meaning early observations can get “buried” as the window grows.

2.3 LLM (Reasoning Engine)

The LLM reads the full context at each step and produces either:

  • A Thought + Action — reasoning plus a tool call to execute, or
  • A Final Answer — the completed response to the user.

The quality of the agent is bounded by the quality of its LLM. Stronger models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) handle complex multi-step planning significantly better than smaller ones. The AgentBench evaluation (Liu et al., 2024) demonstrated a stark capability gap: GPT-4-class models scored 4–5× higher than open-source alternatives on agent tasks.

2.4 Tool Belt

Tools are the agent’s hands. They bridge the gap between reasoning (what the LLM does well) and acting (what requires external systems). Common tool categories:

Category Examples
Information retrieval Web search, database queries, file reads
Computation Calculator, code interpreter, data analysis
Actions Send email, create ticket, deploy code
Verification Fact-checking, schema validation, test runner

The Toolformer paper (Schick et al., 2023) demonstrated that LLMs choose tools primarily based on their descriptions — making tool design the highest-leverage engineering task in agent development.


3. Execution Trace: Seeing the Loop in Action

Abstract descriptions only go so far. Here is a concrete execution trace of a ReAct agent answering a factual comparison question:

Visual execution trace showing a ReAct agent completing a task in 3 iterations with 2 tool calls, including the Thought, Action, and Observation at each step, plus run metrics.
Figure: A complete ReAct execution trace — 3 iterations, 2 tool calls, final answer with run metrics. Each step is logged for observability.

Key observations from this trace:

  • The agent decided when to stop — After iteration 3, it had enough information and produced the final answer without another tool call.
  • Each iteration built on the previous — The Thought in iteration 2 explicitly referenced the result from iteration 1.
  • The cost was modest — 1,240 tokens and $0.006 for a complete, sourced answer. This is why single-agent ReAct is cost-effective for most tasks.

4. What Is It Used For?

The single ReAct agent is the right architecture for the majority of tool-using AI tasks. It excels wherever a task requires multiple steps, dynamic decision-making, and interaction with external systems — but doesn’t require the overhead of multiple coordinating agents.

4.1 Conversational Assistants with Tool Access

The most common use case: a chatbot that can look things up, perform calculations, and take actions on behalf of the user.

  • Customer serviceKlarna’s AI assistant uses a single-agent loop to look up orders, process refunds, and answer billing questions — handling two-thirds of all customer service conversations.
  • Internal helpdesks — IT support agents that search knowledge bases, check system status, and create tickets.
  • Personal assistants — ChatGPT Plus uses ReAct-style tool calling to search the web, run code, and generate images within a single conversation.

4.2 Coding Assistants

Agents that read code, write changes, run tests, and iterate on errors.

  • GitHub Copilot Agent Mode — A single ReAct agent inside the IDE with tools for file I/O, terminal commands, code search, and error checking. It reads relevant files, plans changes, edits code, runs tests, and iterates until the task is complete.
  • Cursor — AI-powered IDE that uses a ReAct-style agent to understand codebases, propose edits, and apply them across files.

4.3 Data Analysis and Research

Agents that query data, perform calculations, and produce insights.

  • Code Interpreter (ChatGPT) — A ReAct agent with a Python sandbox. It writes and executes code to analyze uploaded data, generate charts, and answer analytical questions.
  • Julius AI — A data analysis agent that connects to data sources, writes SQL/Python, visualizes results, and iterates based on user feedback.

4.4 Search and Summarization

Agents that find, synthesize, and present information from multiple sources.

  • Perplexity AI — A search agent that retrieves, cross-references, and summarizes web sources with citations.
  • GPT Researcher — An open-source single agent that autonomously searches the web, gathers sources, and writes comprehensive research reports.

4.5 Workflow Automation

Agents that orchestrate multi-step business processes.

  • Email triage — Read emails, classify urgency, draft responses, create follow-up tasks.
  • Document processing — Extract data from invoices, validate against a database, flag discrepancies, generate a summary.
  • CI/CD helpers — Read build logs, diagnose failures, suggest fixes, open pull requests.

5. Pros and Cons

Pros

  • Simple to build and debug — One LLM, one loop, one context. The entire agent trajectory is a linear sequence of Thought → Action → Observation steps that can be read like a log. Compared to multi-agent systems, there are no inter-agent communication protocols, handoff bugs, or distributed state to manage.

  • Cost-effective — A typical single-agent task completes in 3–10 iterations, consuming 1,000–5,000 tokens. At GPT-4o pricing (~$2.50 per million input tokens, ~$10 per million output tokens), most tasks cost $0.005–$0.05. The SWE-bench analysis shows that even complex coding tasks average 20–50 LLM calls — still within a single-agent budget.

  • Adaptive and self-correcting — When a tool call fails or returns unexpected results, the agent can re-reason and try an alternative approach. The Reflexion framework (Shinn et al., 2023) demonstrated that agents with self-reflection improve success rates by 20–30% on coding and reasoning benchmarks.

  • Framework support is mature — Every major agent framework supports single ReAct agents out of the box: Spring AI, LangGraph, OpenAI Agents SDK, Google ADK, Semantic Kernel, and Haystack all provide tool-calling agent primitives out of the box.

  • Easy to evaluate — The linear trajectory makes it straightforward to build evaluation datasets of (task, expected trajectory, expected output) triples and score agents on correctness, efficiency, and safety.

  • Interacts with the real world — Unlike prompt-only or RAG-only systems, a ReAct agent can take actions: send emails, write files, query APIs, execute code. This makes it practical for real automation, not just information synthesis.

Cons

  • Single point of failure — One LLM makes all decisions. If the model hallucinates, picks the wrong tool, or misinterprets an observation, there is no second agent to catch the error. Multi-agent systems can implement reviewer or verifier roles for redundancy.

  • Context window is a hard ceiling — Every iteration appends ~200–500 tokens (thought + action + observation) to the context. A 128k-token model can sustain roughly 50–100 iterations before performance degrades. For very long tasks, this is a real limitation. The MemGPT paper demonstrated that without active context management, agent performance drops sharply as the window fills.

  • Sequential latency — Each iteration requires a full LLM inference call (1–3 seconds) plus tool execution time. An agent making 10 iterations adds 15–30 seconds of latency. For interactive use cases where users expect sub-second responses, this can be unacceptable.

  • Limited task breadth — A single agent must hold all domain knowledge, tool expertise, and reasoning strategies in one system prompt and context window. For tasks spanning multiple domains (e.g., research + code + design), the agent’s prompt becomes overloaded and tool selection degrades. This is precisely where multi-agent architectures add value.

  • Unpredictable behaviour — The LLM may choose unexpected tools, generate malformed arguments, loop on failing calls, or take unintended actions. Debugging requires full trajectory tracing, and the same input may produce different trajectories on different runs due to model non-determinism.

  • Security risk amplification — Each tool call is a potential attack surface. A prompt injection embedded in a tool’s output can hijack subsequent reasoning. OWASP’s Top 10 for LLM Applications lists “Excessive Agency” as a critical risk, and the single-agent pattern concentrates all agency in one decision-maker.


6. When to Use a Single ReAct Agent (and When Not To)

The single ReAct agent is not always the right choice. Use the following decision flow:

Decision flow diagram with three questions — Does the task need tools? Does it require multi-step reasoning? Can one agent cover all needs? — leading to Prompt/RAG, Single Tool Call, Single ReAct Agent, or Multi-Agent as outcomes.
Figure: Choose a single ReAct agent when the task needs tools AND multi-step reasoning AND can be handled by one agent. Otherwise, use a simpler or more complex architecture.

Use a Single ReAct Agent When:

  • The task requires 2–15 tool calls across 1–3 tool types.
  • A single system prompt can capture all necessary instructions and constraints.
  • The total trajectory fits comfortably within the model’s context window.
  • Latency of 5–30 seconds is acceptable.
  • The task is within a single domain (customer service, coding, data analysis).

Don’t Use a Single ReAct Agent When:

  • A single LLM call or RAG query can solve the task — use the simpler approach.
  • The task spans multiple domains that require distinct expertise — use multi-agent collaboration.
  • You need redundancy and verification (e.g., a writer + reviewer) — use a hierarchical or multi-agent system.
  • The task requires 100+ iterations or produces very large intermediate outputs — the context window will overflow.

“The best agent is the simplest one that solves your problem.” — Harrison Chase, CEO of LangChain, LangChain Blog, 2024


7. Best Practices

Building a ReAct agent that works in a demo is easy. Building one that works reliably in production requires disciplined engineering. These best practices are drawn from Anthropic’s building effective agents guide, OpenAI’s practical guide to building agents, and hard-won production experience.

Visual summary of eight best practices for single ReAct agents: tool design, budgets, structured outputs, context management, observability, human gates, sandboxing, and graceful failure.
Figure: The eight practices that separate reliable production agents from fragile prototypes.

7.1 Design Tools for the LLM, Not for Humans

The LLM selects tools based on their names and descriptions. Treat tool definitions as API documentation written for a junior developer:

  • Clear, descriptive namesget_customer_order_history not query_db.
  • Narrow scope — One tool, one job. Avoid Swiss-army-knife tools like run_sql(query).
  • Explicit schemas — Use JSON schemas with required fields, enums, and descriptions for every parameter.
  • Actionable error messages — “Order ORD-789 not found. Verify the order ID and try again.” not “500 Internal Server Error.”
{
  "name": "get_order_status",
  "description": "Look up the current status of a customer order by order ID. Returns the order status, shipping carrier, and tracking number if available. Use this when a customer asks about a specific order.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "The order identifier (e.g., 'ORD-12345')"
      }
    },
    "required": ["order_id"]
  }
}

OpenAI’s function calling best practices emphasise: “The description field is the most important part of a function definition.” The Toolformer paper confirms that tool descriptions are the primary signal the LLM uses for tool selection.

7.2 Set Budgets and Guardrails

Without explicit limits, a confused agent will loop indefinitely, consuming unbounded tokens and time.

  • Max iterations — Cap at 15–25 steps for most tasks. Harrison Chase, in his 2024 talk on agent reliability, noted that “the most common production agent bug is an infinite loop.”
  • Max tokens per run — Set a hard token budget (e.g., 50,000 tokens) to prevent cost explosions.
  • Wall-clock timeout — Kill runs exceeding a reasonable duration (e.g., 60–120 seconds).
  • Per-tool retry limit — If a tool fails 2–3 times, stop retrying and escalate.
  • Fallback — On limit hit, return a partial result, not silence.
@Service
public class BudgetedAgent {

    private final ChatClient chatClient;

    public BudgetedAgent(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("You are a helpful research assistant.")
            .defaultTools(new SearchTool(), new CalculatorTool())
            .build();
    }

    // Application-level timeout
    public String runWithBudget(String message, Duration maxDuration) {
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            var future = executor.submit(() ->
                chatClient.prompt()
                    .user(message)
                    .toolCallLimit(25)  // Max 25 tool-call iterations
                    .call()
                    .content()
            );
            try {
                return future.get(maxDuration.toSeconds(), TimeUnit.SECONDS);
            } catch (TimeoutException e) {
                future.cancel(true);
                return "Agent timed out after %ds. Escalating to human."
                    .formatted(maxDuration.toSeconds());
            }
        } catch (Exception e) {
            return "Agent failed: " + e.getMessage();
        }
    }
}

7.3 Use Structured Outputs

Force the LLM to produce tool calls as validated JSON. Freeform text parsing with regex is the leading cause of tool-call failures in production.

  • OpenAI Structured Outputs — Guarantees JSON conformance via constrained decoding (server-side — works from any language).
  • Spring AI Structured Output — Maps LLM responses directly to Java records/classes using Jackson schema generation with automatic retries.
  • Outlines — Grammar-constrained generation for open-source models.

Jason Liu (creator of Instructor) reports that structured output enforcement reduces agent tool-call failures by 30–50% compared to freeform text parsing.

import com.fasterxml.jackson.annotation.JsonPropertyDescription;
import org.springframework.ai.chat.client.ChatClient;

public enum ToolChoice {
    SEARCH_WEB, GET_ORDER_STATUS, CALCULATE
}

public record AgentAction(
    @JsonPropertyDescription("Chain-of-thought reasoning for this step")
    String reasoning,
    @JsonPropertyDescription("Which tool to call next")
    ToolChoice tool,
    @JsonPropertyDescription("Arguments to pass to the tool")
    Map<String, Object> arguments
) {}

// Spring AI structured output — the response is guaranteed to map to AgentAction
ChatClient chatClient = chatClientBuilder.build();

AgentAction action = chatClient.prompt()
    .user("What is the status of order ORD-12345?")
    .call()
    .entity(AgentAction.class);
// action.tool() is guaranteed to be a valid enum value — no parsing errors

7.4 Manage the Context Window

Every iteration appends Thought + Action + Observation to the context. Without management, the window fills in 10–20 steps for tasks with large tool outputs.

  • Summarize tool outputs — If a tool returns 5,000 tokens, extract the relevant 200 tokens before appending.
  • Truncate conversation history — Keep recent turns in full, summarize older ones.
  • Reserve capacity — Keep at least 20–30% of the context window free for the model’s reasoning and output.
  • Use retrieval — For information the agent might need later, store externally and retrieve on demand rather than keeping it in context.

The MemGPT paper (Packer et al., 2023) formalized this as “virtual memory management” for LLMs — paging information in and out of context as needed.

7.5 Invest in Observability

You cannot improve what you cannot see. Log every step with full fidelity:

What to Log Why
Thought / reasoning trace Understand why the agent chose an action
Tool call (name + args) Audit what the agent did
Tool output (full response) Trace data flow and identify bad inputs
Token count per step Cost attribution and budget monitoring
Latency per step Identify bottlenecks
Final outcome Measure overall reliability

Use a dedicated observability platform: LangSmith, Arize Phoenix, or Braintrust. Chip Huyen, in AI Engineering (O’Reilly, 2025), argues: “Without tracing, you are debugging LLM applications with a blindfold on.”

7.6 Gate High-Stakes Actions

Classify every tool by risk tier and require human approval for irreversible operations:

Tier Risk Examples Approval
Read Low Search, fetch data, read file Auto-execute
Reversible write Medium Create draft, stage changes Auto-execute + log
Irreversible High Send email, delete, deploy, pay Human approval

The OWASP Top 10 for LLM Applications identifies “Excessive Agency” as a critical risk. The 2024 Air Canada chatbot incident — where an AI agent invented a discount policy and the company was held legally liable — demonstrates the real-world consequences.

7.7 Sandbox Code Execution

If your agent generates and executes code, that code must run in an isolated sandbox — never on the host system. Use:

  • E2B — Sandboxed cloud environments purpose-built for agent code execution.
  • Docker containers — Process-level isolation with restricted network and filesystem.
  • Firecracker — Lightweight microVMs for sub-second isolation.

Simon Willison has extensively documented how prompt injection in tool-using LLMs leads to arbitrary code execution when sandboxing is absent. Treat all agent-generated code as untrusted input.

7.8 Implement Graceful Failure and Escalation

Agents will fail. Design for it:

  • Tool failure — After 2–3 retries, stop and try an alternative or escalate.
  • Uncertainty — Instruct the agent: “If you are not confident, say so and suggest the user contact support.”
  • Escalation — Every production agent needs a handoff path to a human. This is a feature, not a failure.
  • Partial results — When hitting a budget limit, return whatever useful work was completed.
System prompt excerpt:
"If you cannot answer the user's question after using your available tools,
do NOT make up an answer. Instead respond: 'I wasn't able to find that
information. Let me connect you with a team member who can help.'
Then call the escalate_to_human tool."

8. Common Mistakes and How to Avoid Them

Five common failure modes of a single ReAct agent — infinite loops, wrong tool selection, context overflow, hallucinated actions, and error cascading — each paired with its mitigation strategy.
Figure: The five most common failure modes in production ReAct agents, with their mitigations.

8.1 No Limits on the Loop

What happens: The agent encounters an ambiguous task or a flaky tool and retries the same failing call indefinitely. A single runaway task can consume $5–$50 in API costs and block resources for minutes.

How to avoid it: Set max iterations, token budget, and wall-clock timeout from day one. A 25-step limit and 60-second timeout are reasonable defaults. Monitor and adjust based on production telemetry.

8.2 Swiss-Army-Knife Tools

What happens: A single execute_sql(query) or run_code(code) tool gives the agent maximum flexibility — and maximum opportunity to cause damage. It can write arbitrary queries, access unauthorized tables, or produce malformed outputs.

How to avoid it: Decompose broad capabilities into narrow, purpose-built tools. If the agent needs order data, give it get_order(id), search_orders(customer_id), and cancel_order(id) — not raw database access.

8.3 Appending Everything to Context

What happens: The default in most frameworks is to append every tool output verbatim. Tools returning large JSON payloads, database results, or web pages fill the context in a few steps, causing overflow and attention degradation.

How to avoid it: Implement a context management strategy. Summarize large outputs, drop irrelevant history, and use retrieval for information the agent might need later. The “Lost in the Middle” paper demonstrated that LLMs lose accuracy when key information is buried in the middle of a long context.

8.4 Parsing Tool Calls with Regex

What happens: Teams extract tool calls from freeform LLM text using string matching. Model output format varies between calls, models, and versions. A single misplaced comma or markdown formatting artifact breaks the parser.

How to avoid it: Use constrained decoding (OpenAI Structured Outputs) or a structured output library (Instructor, Outlines). Never rely on the LLM to “usually” produce valid JSON — enforce it mechanically.

8.5 No Human Oversight for Writes

What happens: The agent autonomously sends emails, deletes records, or deploys code without any approval gate. A hallucinated action or prompt injection causes real-world damage.

How to avoid it: Default to requiring approval for any action that modifies external state. Relax constraints only after thorough testing and with robust logging. Anthropic’s agent design guide explicitly recommends: “For high-stakes tasks, build in confirmation steps.”

8.6 Logging Only the Final Answer

What happens: When the agent produces a wrong answer, there is no trail to diagnose the root cause. Was it a bad tool call? A hallucinated plan? A retrieval failure? Without trajectory logs, you are debugging blind.

How to avoid it: Instrument every step from day one. Treat agent traces like distributed system traces — each step is a “span.” Platforms like LangSmith render these as visual timelines, making debugging tractable.

8.7 “It Worked on My Demo”

What happens: A live demo is not an evaluation. Without a systematic dataset covering edge cases, error conditions, and adversarial inputs, you have no idea what the real failure rate is.

How to avoid it: Build a benchmark of 50–200 representative tasks from your domain. Run the agent against it regularly — ideally in CI/CD. Track pass rate, average steps, average cost, and safety violations over time. The AgentBench framework provides a template for comprehensive agent evaluation.


9. Real-World Examples

9.1 GitHub Copilot Agent Mode

GitHub Copilot’s agent mode is a textbook single ReAct agent. Given a task like “Add input validation to the registration form,” it:

  1. Reads relevant source files (tool: file read).
  2. Reasons about the changes needed (thought).
  3. Edits multiple files (tool: file write).
  4. Runs tests and linters (tool: terminal command).
  5. Iterates on failures until the task passes.

It embodies multiple best practices simultaneously: narrow tools (file read, file write, terminal, search — not a single “do anything” tool), human-in-the-loop (requires approval for terminal commands), full observability (each step visible in the IDE), and graceful failure (reports errors and suggests next steps rather than failing silently).

9.2 ChatGPT with Tools

When a ChatGPT Plus user asks “What’s the weather in Tokyo and how does the stock market look?”, the underlying system:

  1. Calls a web search tool for Tokyo weather.
  2. Reads the observation.
  3. Calls a web search tool for stock market data.
  4. Reads the observation.
  5. Synthesizes both results into a coherent response.

This is a single ReAct agent — one model, multiple tool calls, iterative reasoning. OpenAI’s Agents SDK exposes this same pattern as a developer primitive.

9.3 Klarna AI Assistant

Klarna’s AI assistant is a production-grade single ReAct agent. In its first month:

  • Handled 2.3 million conversations (two-thirds of all customer service chats).
  • Achieved customer satisfaction scores on par with human agents.
  • Reduced average resolution time from 11 minutes to under 2 minutes.

It uses scoped tools (order lookup, refund processing, FAQ retrieval — not raw database access), structured response templates, guardrails preventing unauthorized actions, and a seamless escalation path to human agents for cases it can’t handle. The remaining third of conversations is escalated — and this graceful handoff is a feature, not a failure.

9.4 Perplexity AI

Perplexity AI operates as a search-and-synthesis ReAct agent:

  1. Decomposes the user’s question into search queries.
  2. Executes multiple web searches.
  3. Reads and cross-references the results.
  4. Produces a synthesized answer with inline citations.

It demonstrates that a single ReAct agent can produce high-quality research output when equipped with the right tools and constrained to cite its sources.

9.5 Code Interpreter (ChatGPT Advanced Data Analysis)

OpenAI’s Code Interpreter is a ReAct agent with a sandboxed Python runtime:

  1. User uploads a CSV and asks “What are the top revenue trends?”
  2. The agent writes Python code to load and analyze the data.
  3. Executes the code in a sandbox.
  4. Reads the output (or error).
  5. Iterates — fixes bugs, refines the analysis, generates charts.
  6. Presents the final result.

This showcases the sandboxing best practice: all code runs in an isolated environment, not on the host system. It also demonstrates self-correction — when code fails, the agent reads the error traceback and generates a fix.


10. Example: Building a Single ReAct Agent

With Spring AI

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.ai.tool.annotation.ToolParam;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Service;

// Define narrow, well-described tools

@Component
public class SearchTool {

    @Tool(description = """
            Search the web for current information. Use this when you need \
            up-to-date facts, news, or data not in your training set.""")
    public String searchWeb(
            @ToolParam(description = "The search query") String query) {
        // Implementation: call a search API (Tavily, DuckDuckGo, etc.)
        return WebSearchClient.search(query);
    }
}

@Component
public class CalculatorTool {

    @Tool(description = """
            Evaluate a mathematical expression and return the result. \
            Use this for any arithmetic, percentages, or numerical comparisons. \
            Example: calculate("(37.4 - 34.4) / 34.4 * 100")""")
    public String calculate(
            @ToolParam(description = "The mathematical expression") String expression) {
        try {
            var engine = new javax.script.ScriptEngineManager()
                    .getEngineByName("JavaScript");
            return String.valueOf(engine.eval(expression));
        } catch (Exception e) {
            return "Error: %s. Check the expression syntax and try again."
                    .formatted(e.getMessage());
        }
    }
}
// Create and run the agent

@Service
public class ResearchAgent {

    private final ChatClient chatClient;

    public ResearchAgent(ChatClient.Builder builder,
                         SearchTool searchTool,
                         CalculatorTool calculatorTool) {
        this.chatClient = builder
            .defaultSystem("""
                    You are a helpful research assistant. Use tools to find \
                    accurate, up-to-date information. If you cannot find an \
                    answer, say so — never fabricate data. Cite your sources.""")
            .defaultTools(searchTool, calculatorTool)
            .build();
    }

    public String ask(String question) {
        return chatClient.prompt()
            .user(question)
            .toolCallLimit(15)  // Max 15 iterations
            .call()
            .content();
    }
}
// Wire it up and run

@SpringBootApplication
public class AgentApplication implements CommandLineRunner {

    private final ResearchAgent agent;

    public AgentApplication(ResearchAgent agent) {
        this.agent = agent;
    }

    public static void main(String[] args) {
        SpringApplication.run(AgentApplication.class, args);
    }

    @Override
    public void run(String... args) {
        String answer = agent.ask(
            "What is the population of Tokyo and Delhi, "
                + "and what is the percentage difference?"
        );
        System.out.println(answer);
    }
}

References & Further Reading

Foundational Papers

  1. ReAct — Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023. The paper that defined the Thought → Action → Observation loop. The foundational pattern behind most modern single-agent systems.
  2. Toolformer — Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools”, NeurIPS 2023. Demonstrates that LLMs select tools based primarily on descriptions — making tool design the highest-leverage engineering task.
  3. Reflexion — Shinn, N. et al. “Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. Agents that reflect on failures and improve — self-correction best practice for single-agent systems.
  4. Chain-of-Thought Prompting — Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022. The reasoning technique that powers the “Thought” step in ReAct.
  5. Tree of Thoughts — Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023. Exploring multiple reasoning branches — an advanced planning strategy for single agents.
  6. MemGPT / Letta — Packer, C. et al. “MemGPT: Towards LLMs as Operating Systems”, 2023. Virtual memory management for agents — essential for long-running single-agent tasks. Evolved into Letta in 2024.
  7. Lost in the Middle — Liu, N.F. et al. “Lost in the Middle: How Language Models Use Long Contexts”, TACL 2024. Reveals attention degradation in long contexts — motivates active context management in ReAct agents.

Evaluation and Benchmarks

  1. AgentBench — Liu, X. et al. “AgentBench: Evaluating LLMs as Agents”, ICLR 2024. Comprehensive benchmark demonstrating the capability gap between GPT-4-class and smaller models on agent tasks.
  2. SWE-bench — Jimenez, C.E. et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, ICLR 2024. Gold-standard benchmark for coding agents. SWE-bench Verified (2024) provides a human-validated subset.
  3. SWE-Agent — Yang, J. et al. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”, 2024. Reveals the cost and step count of autonomous single-agent coding — 20–50 LLM calls per issue.

Safety and Security

  1. OWASP Top 10 for LLM ApplicationsOWASP Foundation, 2023–2025. Industry-standard security risks including “Excessive Agency” — essential reading for agent builders.
  2. Prompt Injection — Greshake, K. et al. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, 2023. How prompt injection hijacks agent tool calls — motivating sandboxing and approval gates.

Industry Guides

  1. Building Effective Agents — Anthropic, “Building effective agents”, 2024. Anthropic’s practical guide — recommends starting with single agents, human-in-the-loop, and incremental complexity.
  2. A Practical Guide to Building Agents — OpenAI, “A practical guide to building agents”, 2025. OpenAI’s comprehensive guide covering design patterns, orchestration, and guardrails.
  3. OpenAI Function Calling Best Practices — OpenAI, “Function calling guide”, 2024. Official guidance on tool design, descriptions, and structured outputs.
  4. Agent2Agent (A2A) Protocol — Google, “Agent2Agent Protocol”, 2025. Open standard for inter-agent communication — relevant when graduating from single to multi-agent.
  5. Model Context Protocol (MCP) — Anthropic, modelcontextprotocol.io, 2024. Open standard for connecting LLMs to external tools and data sources — the universal protocol for agent-tool integration.

Books

  1. Stuart Russell & Peter Norvig — Artificial Intelligence: A Modern Approach, 4th ed., Pearson, 2020. The definitive textbook on intelligent agents. Chapters 2–4 define the agent concept, environment types, and rational behaviour that LLM agents inherit.
  2. Chip Huyen — AI Engineering, O’Reilly, 2025. The most relevant production guide for agent builders. Covers architectures, observability, evaluation, and deployment. Argues “without tracing, you are debugging LLM applications with a blindfold on.”
  3. Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide with code examples covering prompt engineering, RAG, tool use, and agent pipelines.
  4. Sebastian Raschka — Build a Large Language Model (From Scratch), Manning, 2024. Understand the LLM internals that power agent reasoning — pre-training, fine-tuning, and RLHF from first principles.
  5. Harrison Chase & Jacob Lee — LangChain Documentation & Guides, LangChain, 2023–2026. The most widely adopted framework for building ReAct agents, with extensive tutorials and production patterns.
  6. Andrew Ng — AI Agentic Design Patterns with AutoGen, DeepLearning.AI, 2024. Short course covering ReAct, tool use, reflection, and multi-agent patterns with hands-on code.

Tools & Platforms

  1. Spring AI — Spring ecosystem framework for AI/LLM applications. Provides ChatClient with built-in tool calling, structured output (.entity()), and advisor-based extensibility for building ReAct-style agents in Java.
  2. LangGraph — Graph-based agent orchestration. create_react_agent is the standard primitive for single-agent ReAct loops.
  3. OpenAI Agents SDK — OpenAI, 2025. Lightweight framework with built-in guardrails and tracing for single and multi-agent workflows.
  4. Google Agent Development Kit (ADK) — Google, 2025. Open-source framework with native A2A and MCP support for building, evaluating, and deploying agents.
  5. LangSmith — Observability, tracing, and evaluation platform for agent trajectories.
  6. Arize Phoenix — Open-source LLM observability with trace visualization and embedding analysis.
  7. E2B — Sandboxed cloud environments for safe agent code execution.
  8. Instructor — Pydantic-based structured output extraction for Python agents.
  9. Outlines — Grammar-constrained generation for open-source models.