Deep Dive: Hierarchical Agent Systems — Supervisors, Workers, Delegation, and Quality Control

The single ReAct agent excels at single-domain tasks. Multi-agent systems extend this to multi-domain work by coordinating specialized agents. But there’s a specific multi-agent topology that deserves its own deep dive: the hierarchical supervisor pattern — where a supervisor agent decomposes tasks, delegates to worker agents, reviews their outputs, and iterates until the result meets a quality bar.

This pattern mirrors how effective organizations work: a manager plans the work, assigns tasks to specialists, reviews deliverables, and sends work back for revision when it doesn’t meet the standard. It is the most common architecture for complex, quality-sensitive tasks where the output must be verified before delivery — code generation with testing, content creation with editorial review, and enterprise workflows with compliance gates.

This post is a deep dive into hierarchical agent systems: what they are, how the delegation cycle works, what they’re used for, their strengths and weaknesses, the best practices that make them reliable, the common mistakes that break them, and the real-world systems that prove they work at scale.


1. What Is a Hierarchical Agent System?

A hierarchical agent system is a multi-agent architecture where a supervisor agent (the manager) coordinates one or more worker agents (the specialists) in a structured delegation cycle. The supervisor’s job is to decompose, delegate, review, and iterate — it never does the actual work itself. Workers execute tasks using their own tools and return results to the supervisor for evaluation.

The key differentiator from other multi-agent patterns:

  • Centralized control — One supervisor makes all routing, decomposition, and quality decisions. Workers don’t talk to each other directly.
  • Built-in quality loop — The supervisor reviews every worker output before accepting it, and can send work back for revision.
  • Task decomposition as a first-class concern — The supervisor explicitly breaks high-level goals into subtasks before delegating, creating a traceable plan.
  • Accountability — The supervisor is the single point of accountability for the entire workflow. It decides when the task is complete.

“The supervisor pattern mirrors how organizations work — managers decompose goals, delegate to specialists, and synthesize. It works for agent systems for the same reason it works for human teams: specialization + oversight produces better results than either alone.” — Anthropic, Building effective agents, 2024

How It Differs from Other Patterns

Pattern Coordination Quality Control Task Decomposition Routing
Single ReAct agent Self (loop) None None (single task) N/A
Sequential pipeline Fixed order No review loop Pre-defined phases Static
Peer-to-peer handoff Decentralized None built-in None (dynamic routing) LLM-based
Parallel fan-out Orchestrator Merge-time only Pre-defined subtasks Static
Hierarchical supervisor Centralized Review + revision loop LLM-decomposed Supervisor-driven

The hierarchical pattern adds a quality loop that sequential pipelines lack and centralized control that peer-to-peer handoffs don’t have. This makes it the right choice when output quality matters more than speed.


2. Internal Architecture

Understanding what’s inside a hierarchical agent system clarifies the supervisor’s role, how workers operate, and where the quality loop fits.

Hierarchical agent system architecture showing a supervisor agent at the top delegating to three worker agents (Researcher, Coder, Reviewer), each with scoped tools, all connected to a shared state store.
Figure: Inside a hierarchical agent system — the supervisor decomposes and delegates, workers execute with scoped tools, results flow through shared state, and the supervisor reviews before accepting.

2.1 The Supervisor Agent

The supervisor is an LLM with a system prompt focused entirely on planning, delegation, review, and synthesis. It does not have domain-specific tools — it cannot search the web, write code, or edit files. Its tools are meta-level:

  • delegate_to_worker(worker, task_spec) — Assign a subtask to a specific worker agent.
  • request_revision(worker, feedback) — Send a worker’s output back with specific improvement instructions.
  • approve_output(worker) — Accept a worker’s output and mark the subtask as complete.
  • escalate_to_human(reason) — Hand off to a human when the system can’t resolve the task.

The supervisor reads shared state to understand the current progress and decides what to do next. This is the architectural analogue of a project manager who doesn’t write code but knows how to plan, delegate, and review.

2.2 Worker Agents

Each worker is a self-contained ReAct agent — one LLM with a focused system prompt and scoped tools, executing a Thought → Action → Observation loop. Workers:

  • Have a narrow system prompt defining their single role (researcher, coder, tester, reviewer).
  • Access only their own tools — the researcher has search_web and arxiv_search, the coder has file_read and file_write, the tester has run_tests.
  • Receive a structured task specification from the supervisor, not vague prose.
  • Write their outputs to shared state, making results available to the supervisor and other workers.
  • Have no knowledge of other workers — they don’t know the system is hierarchical. They just receive a task, execute it, and return a result.

2.3 Shared State Store

The shared state store (see Multi-Agent Systems Deep Dive, Section 4) holds the structured outputs of every worker, the supervisor’s plan, and the current workflow status. The supervisor reads it to decide routing; workers write their results to specific fields.

public record HierarchicalWorkflowState(
    String userGoal,
    TaskPlan taskPlan,              // Supervisor's decomposition
    String researchFindings,        // Worker A output
    String codeOutput,              // Worker B output
    TestResults testResults,        // Worker B output (run_tests)
    String reviewFeedback,          // Worker C output
    int revisionCount,              // Track revision iterations
    WorkflowStatus status
) {}

public record TaskPlan(List<SubTask> subtasks) {}
public record SubTask(String worker, String description, String acceptanceCriteria) {}

3. The Delegation Cycle

The hierarchical agent operates in a delegation cycle that the supervisor drives. This is the heartbeat of the entire system — and understanding it is key to building reliable hierarchical workflows.

The supervisor delegation cycle: Decompose task into subtasks, Delegate subtask to worker, Worker executes via ReAct loop, Supervisor reviews output, Decision gate for quality, either request revision or proceed to next subtask, loop until all subtasks complete.
Figure: The supervisor's delegation cycle — decompose, delegate, wait, review, and either approve or request revision. The cycle repeats for each subtask until the workflow is complete.

The Four Phases

Phase 1: Decompose — The supervisor reads the user’s goal and breaks it into an ordered list of subtasks. Each subtask specifies which worker should handle it and what the acceptance criteria are.

Supervisor Thought: "The user wants input validation on the registration form.
I need: (1) Coder to implement the validation, (2) Reviewer to check the code,
(3) potentially a second Coder pass if revisions are needed."

Plan:
  1. Coder → "Add email format validation, password strength check, and @Valid annotation"
  2. Reviewer → "Review the code for correctness, security, and completeness"
  3. (Conditional) Coder → "Apply reviewer's corrections"

Phase 2: Delegate — The supervisor sends a structured task specification to the designated worker. The spec includes the task description, any relevant context from shared state, and the acceptance criteria.

Phase 3: Execute — The worker runs its own ReAct loop — reasoning, calling tools, observing results — until it produces an output. The supervisor waits while the worker executes.

Phase 4: Review — The supervisor evaluates the worker’s output against the acceptance criteria:

  • Approve → Mark the subtask as complete, move to the next one.
  • Request revision → Send the worker specific feedback on what to fix, and re-delegate.
  • Escalate → If the worker fails after max revision attempts, escalate to a human.

This cycle repeats until all subtasks are complete. The supervisor then synthesizes the final output from all worker contributions.

The Quality Gate

The review phase is what distinguishes hierarchical from sequential pipelines. In a sequential pipeline, Agent A’s output flows directly to Agent B without any quality check. In a hierarchical system, the supervisor inspects every output before it proceeds — catching errors early, before they propagate downstream.


4. Execution Trace: Seeing the Hierarchy in Action

Here is a concrete execution trace of a hierarchical system implementing a coding task with review:

Execution trace of a hierarchical coding workflow: Supervisor decomposes and delegates to Coder (6 iterations, 4 tool calls), Supervisor delegates to Reviewer (3 iterations, no tools), Supervisor requests revision, Coder applies fixes (3 iterations, 2 tool calls), Supervisor approves. Total: 12 iterations, 6 tool calls, 7,240 tokens.
Figure: A complete hierarchical execution trace — the Supervisor delegates coding, reviews the output, requests revision based on feedback, and approves the final result after one revision round.

Key observations from this trace:

  • The review loop caught real issues — The reviewer identified missing email validation, an incorrect constant, and a missing annotation. Without the review phase, these bugs would have been delivered to the user.
  • The supervisor never wrote code — It decomposed the task, delegated, reviewed, requested revision, and approved. This is the correct role for a supervisor — it’s a manager, not an individual contributor.
  • One revision round was sufficient — The coder fixed all issues on the first revision pass. The cap of 2–3 revision rounds (best practice) was not hit.
  • Cost was reasonable — 7,240 tokens and $0.036 for a coding task with code review. The overhead of supervision was ~30% compared to a single-agent approach, but the output quality was measurably higher.
  • Latency was sequential — 18.7 seconds total because each phase waited for the previous one. For interactive use cases, this may be too slow; for batch/async workflows, it’s acceptable.

5. What Are Hierarchical Agent Systems Used For?

Hierarchical systems are the right architecture when tasks require decomposition + specialization + quality control — the combination that a single agent or a simple pipeline can’t provide.

5.1 Software Engineering with Code Review

The canonical use case: an agent that writes code, has it reviewed, applies fixes, and ships only when the code passes quality checks.

  • Devin — Cognition Labs’ autonomous software engineering agent uses a hierarchical architecture internally: a planning layer decomposes GitHub issues into implementation steps, delegates coding and testing to specialized modules, and iterates until tests pass. It achieved 13.86% on SWE-bench.
  • OpenHands — Open-source platform where a delegator agent coordinates a browser agent, a coding agent, and a terminal agent in a hierarchical structure, with the delegator reviewing outputs before proceeding.
  • Amazon Q Developer Agent for code transformation — Uses a supervisor that decomposes language migration tasks (e.g., Java 8 → Java 17), delegates file-by-file transformation to worker agents, and validates each transformation with compilation and tests before proceeding.

5.2 Research with Quality Gates

Tasks that require gathering diverse information, verifying accuracy, and producing a rigorous output.

  • STORM (Stanford) — A hierarchical research system where a coordinator generates an article outline and questions, expert agents provide grounded answers, and a writer agent synthesizes everything — with the coordinator reviewing completeness and coherence at each stage.
  • GPT Researcher — Uses a planner (supervisor) that decomposes research questions into sub-queries, dispatches searcher agents in parallel, reviews the aggregated findings for completeness, and delegates a final report to a writer agent.

5.3 Content Pipelines with Editorial Review

Writer + editor + fact-checker workflows where quality must be verified before publication.

  • Research → Write → Review → Revise — The classic hierarchical content pipeline. A supervisor delegates research, reviews findings for completeness, delegates writing, reviews the draft against style and accuracy criteria, and sends back for revision if needed.
  • Jasper AI — Uses a hierarchical approach where a strategy agent defines content requirements, a drafting agent produces content, a brand-voice enforcement agent reviews for tone consistency, and a compliance agent checks for regulatory issues — with a supervisor orchestrating the flow and requesting revisions.

5.4 Enterprise Workflows with Compliance

Processes that require auditability, role separation, and approval gates.

  • Document processing — A supervisor agent receives invoices, delegates data extraction to a parsing agent, verification to a validation agent, and approval routing to a compliance agent. The supervisor reviews each step and maintains an audit trail.
  • Financial analysis — A supervisor decomposes a financial review into data collection, analysis, and report generation subtasks. Each worker operates in isolation (the data collector can access the database; the report writer cannot), and the supervisor reviews for accuracy.

5.5 Testing and QA Pipelines

Tasks where generating outputs and verifying them are fundamentally different skills.

  • Code generation + testing — A supervisor delegates implementation to a coder agent and verification to a tester agent. The tester’s failure reports feed back through the supervisor to the coder for fixes, creating an iterative quality loop that mirrors TDD practices.
  • Data pipeline validation — A supervisor delegates data transformation to a processing agent and quality checks to a validation agent, iterating until data quality metrics meet thresholds.

6. Pros and Cons

Pros

  • Built-in quality control — The review-and-revision cycle is the defining advantage. Every worker output is evaluated before it’s accepted, catching errors that a sequential pipeline would propagate downstream. The MetaGPT paper (Hong et al., 2024) showed that structured review processes in multi-agent systems improve code quality by 20–30% compared to single-pass generation.

  • Clear accountability and audit trail — The supervisor creates an explicit trace: what was planned, what was delegated, what was reviewed, what was revised. For enterprise workflows with compliance requirements, this structured audit trail is invaluable. Every decision is logged at the supervisor level.

  • Effective task decomposition — The supervisor’s primary skill is breaking complex goals into manageable subtasks. This mirrors the Plan-and-Execute pattern from LangChain, where planning is separated from execution. Research shows that LLMs perform better on smaller, well-defined subtasks than on monolithic complex tasks (Tree of Thoughts, Yao et al., 2023).

  • Isolated worker contexts — Each worker has its own context window. The researcher can consume 50,000 tokens of search results without affecting the coder’s context. This prevents the “Lost in the Middle” degradation (Liu et al., 2024) that plagues single agents on long tasks.

  • Natural escalation path — When a worker fails repeatedly, the supervisor has a natural escalation point: return a partial result, try a different approach, or hand off to a human. This is structurally easier to implement than in flat multi-agent systems where no single agent owns the workflow.

  • Independent worker development — Workers are standard ReAct agents. They can be developed, tested, and improved independently. When the coder agent underperforms, you improve its tools or prompt without touching the reviewer or supervisor.

Cons

  • Supervisor is a single point of failure — If the supervisor LLM makes a bad decomposition (misses a critical subtask, assigns to the wrong worker, or accepts a flawed output), the entire workflow fails. Unlike peer-to-peer systems where any agent can catch errors, the hierarchical pattern concentrates all judgment in one LLM.

  • Sequential bottleneck — The supervisor must wait for each worker to complete before reviewing and deciding next steps. In a 3-worker workflow, latency is at least the sum of all worker execution times plus supervisor review time. This can make hierarchical systems significantly slower than parallel fan-out approaches.

  • Higher total cost — The supervisor makes LLM calls for planning, delegation, review, and synthesis — on top of each worker’s tool-calling iterations. A hierarchical workflow with 3 workers can easily consume 2–3× the tokens of a sequential pipeline, because the supervisor’s review calls are additional overhead.

  • Revision loops can spiral — Without explicit caps, a supervisor that’s never satisfied with a worker’s output can loop indefinitely: delegate → review → revise → review → revise → … The SWE-bench analysis shows that agents can consume 50–150 LLM calls per task; a revision loop amplifies this further.

  • Decomposition quality ceiling — The entire workflow is bounded by the supervisor’s ability to decompose the task correctly. If the supervisor misunderstands the goal, creates overlapping subtasks, or misses critical steps, all downstream work is wasted. The supervisor must be the strongest LLM in the system — weaker models struggle with multi-step planning.

  • Over-engineering risk — Hierarchical systems are the most complex multi-agent pattern. For tasks that a sequential pipeline handles adequately, adding a supervisor introduces coordination overhead without proportional quality gains. As with all multi-agent patterns, the risk of premature complexity is real.


7. When to Use a Hierarchical Supervisor (and When Not To)

Decision flow: Does the task need multi-step decomposition? Does it need different tools per subtask? Does output quality need iterative review? Is there a single point of accountability? If all yes, use Hierarchical Supervisor.
Figure: Choose a hierarchical supervisor when the task needs decomposition, different expertise per subtask, iterative quality review, and centralized accountability. If any of these don't apply, a simpler pattern suffices.

Use a Hierarchical Supervisor When:

  • The task requires decomposition into subtasks that demand different tools or expertise — coding + testing, research + writing + review.
  • Output quality must be verified before delivery — code review, editorial review, compliance checks, or any workflow where a second opinion adds value.
  • Iterative refinement is expected — the first attempt is unlikely to be perfect, and a revision loop is needed.
  • Auditability matters — you need a structured record of what was planned, delegated, reviewed, and approved.
  • There should be a single point of accountability — one agent that owns the end-to-end outcome and can make final decisions.

Don’t Use a Hierarchical Supervisor When:

  • A single ReAct agent can handle the task. Most coding tasks, search-and-summarize tasks, and customer service interactions don’t need decomposition or review.
  • A sequential pipeline (Agent A → Agent B → Agent C) is sufficient. If the workflow has fixed phases that don’t need revision loops, the pipeline is simpler and faster.
  • The subtasks are independent and don’t need review — use parallel fan-out instead, which is faster because it doesn’t wait for supervisor review between steps.
  • Latency is critical — the sequential nature of hierarchy (delegate → wait → review → delegate next) adds significant wall-clock time compared to parallel approaches.
  • The task doesn’t need task decomposition — the supervisor is wasted overhead if the work is a single monolithic step.

“Add a supervisor when you need quality control — when the cost of a bad output exceeds the cost of the review overhead. For customer-facing code generation, that’s almost always. For internal research summaries, it might not be.” — OpenAI, A Practical Guide to Building Agents, 2025


8. Best Practices

Building hierarchical systems that work in production requires discipline across supervisor design, worker isolation, quality gates, and failure handling.

Eight best practices for hierarchical agent systems: thin supervisor, structured task specs, bounded revision, worker isolation, explicit quality gates, per-layer budgets, hierarchical tracing, and deterministic fallback.
Figure: The eight practices that make hierarchical agent systems reliable in production.

8.1 Keep the Supervisor Thin

The supervisor should decompose, delegate, review, and synthesize — nothing else. It should never do worker-level tasks: no searching, no coding, no writing. When the supervisor starts doing the work itself, it overloads its context, confuses its role, and defeats the purpose of having workers.

Why it matters: A supervisor that writes code when the coder is “too slow” mixes planning and execution, making the system harder to debug and the supervisor’s context window a bottleneck. Anthropic’s building effective agents guide recommends: “Keep orchestrator agents focused on coordination, not execution.”

❌ Supervisor prompt: "You are a project manager who can also write code if needed..."
✅ Supervisor prompt: "You ONLY decompose tasks, delegate to workers, review outputs,
   and request revisions. You NEVER write code, search the web, or produce content
   yourself. Your tools are: delegate_to_worker, request_revision, approve_output."

8.2 Delegate with Structured Task Specifications

When the supervisor delegates, it should send a typed task specification — not vague prose. The spec should include the task description, input context (from shared state), acceptance criteria, and the maximum number of tool calls allowed.

Why it matters: “Fix the code” is an ambiguous delegation. “Apply the reviewer’s three corrections: (1) add email regex validation, (2) change MIN_PASSWORD_LENGTH from 6 to 8, (3) add @Valid annotation to the request body parameter. Acceptance: all three changes applied, tests pass.” is actionable and verifiable.

public record TaskSpec(
    String worker,                    // Target worker agent
    String taskDescription,           // What to do
    String inputContext,              // Relevant data from shared state
    List<String> acceptanceCriteria,  // How the supervisor will evaluate
    int maxToolCalls                  // Budget for this subtask
) {}

// Example delegation
TaskSpec spec = new TaskSpec(
    "coder",
    "Add input validation to the UserRegistrationController",
    state.researchFindings(),   // Context from previous worker
    List.of(
        "Email field validated with regex pattern",
        "Password minimum length check (8 characters)",
        "@Valid annotation on request body parameter",
        "All existing tests still pass"
    ),
    10  // Max 10 tool calls for this subtask
);

8.3 Bound the Revision Loop

Cap revision rounds at 2–3 iterations. If a worker can’t meet the acceptance criteria after 3 revisions, the supervisor should escalate — not keep looping. Unbounded revision loops are the most common failure mode in hierarchical systems.

Why it matters: Each revision round costs a full worker execution cycle (5–15 tool calls) plus a supervisor review. Three rounds can quadruple the cost of a subtask. More critically, repeated revision of the same piece of work often indicates a misspecified task, not a worker capability gap — further iterations won’t help.

private static final int MAX_REVISIONS = 3;

public String supervisorReviewLoop(TaskSpec spec) {
    String result = delegateToWorker(spec);
    for (int revision = 0; revision < MAX_REVISIONS; revision++) {
        ReviewResult review = supervisorReview(spec, result);
        if (review.approved()) {
            return result;
        }
        log.info("Revision {}/{}: {}", revision + 1, MAX_REVISIONS, review.feedback());
        result = delegateRevision(spec, result, review.feedback());
    }
    log.warn("Max revisions reached for task: {}. Escalating.", spec.taskDescription());
    return escalateToHuman(spec, result);
}

8.4 Isolate Workers from Each Other

Workers should have no knowledge of other workers. They receive a task specification from the supervisor, execute it with their own tools, and return a result. They don’t know who worked before them, who will work after, or that the system is hierarchical.

Why it matters: When workers know about each other, they may try to coordinate directly, bypassing the supervisor and breaking the centralized control model. Isolation also means workers can be developed and tested independently — the coder doesn’t need to know about the reviewer to work correctly.

8.5 Define Quality Gates in Code, Not in Prompts

The supervisor’s review should include programmatic quality checks — not just “does this look good?” Run tests, check for required patterns, validate against schemas. Use the LLM for subjective judgment only after objective checks pass.

Why it matters: An LLM review is probabilistic. It might approve code that doesn’t compile or accept a report missing a required section. Deterministic checks (compilation, test suite, schema validation) catch objective failures with 100% reliability.

public ReviewResult supervisorReview(TaskSpec spec, String workerOutput) {
    // Step 1: Deterministic checks (fast, reliable)
    if (spec.worker().equals("coder")) {
        CompileResult compile = compiler.compile(workerOutput);
        if (!compile.success()) {
            return ReviewResult.rejected("Code does not compile: " + compile.errors());
        }
        TestResult tests = testRunner.runAll();
        if (!tests.allPassed()) {
            return ReviewResult.rejected("Tests failed: " + tests.failures());
        }
    }

    // Step 2: LLM review for subjective quality (only if objective checks pass)
    String llmReview = supervisorAgent.prompt()
        .user("""
            Review this worker output against the acceptance criteria:

            TASK: %s
            CRITERIA: %s
            OUTPUT: %s

            Respond with APPROVED or REJECTED with specific feedback."""
            .formatted(spec.taskDescription(),
                       spec.acceptanceCriteria(),
                       workerOutput))
        .call()
        .content();

    return llmReview.contains("APPROVED")
        ? ReviewResult.approved()
        : ReviewResult.rejected(llmReview);
}

8.6 Set Per-Layer Budgets

Three levels of budget limits:

  1. Per-worker budget — Each worker has its own max steps, max tokens, and timeout.
  2. Per-subtask budget — The supervisor limits total cost per subtask (including revisions).
  3. Global workflow budget — The entire hierarchy has a hard token and time ceiling.

Why it matters: Without layered budgets, a single runaway worker can consume the entire workflow’s budget. A researcher that makes 50 search calls consumes all the tokens before the coder even starts.

record WorkerBudget(int maxSteps, int maxTokens, Duration maxDuration) {}
record SubtaskBudget(int maxRevisions, int maxTotalTokens) {}
record WorkflowBudget(int maxTotalTokens, Duration maxTotalDuration, int maxSubtasks) {}

// Layered budget configuration
WorkerBudget coderBudget    = new WorkerBudget(15, 20_000, Duration.ofSeconds(30));
WorkerBudget reviewerBudget = new WorkerBudget(5, 10_000, Duration.ofSeconds(15));
SubtaskBudget subtaskBudget = new SubtaskBudget(3, 40_000);  // 3 revisions max
WorkflowBudget globalBudget = new WorkflowBudget(100_000, Duration.ofMinutes(3), 10);

8.7 Implement Hierarchical Tracing

Use parent-child span IDs to trace the relationship between supervisor and worker agent runs. The supervisor’s trace is the parent; each worker’s trace is a child span. This enables debugging tools like LangSmith to render the full hierarchy as a tree.

Why it matters: In a flat trace, supervisor and worker steps are interleaved and hard to distinguish. Hierarchical tracing makes it immediately clear which worker was responsible for a specific output, how many revisions it took, and where the quality gate passed or failed.

import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;

public String runHierarchicalWorkflow(String userGoal) {
    Span supervisorSpan = tracer.spanBuilder("supervisor")
        .setAttribute("user_goal", userGoal)
        .startSpan();

    try (var scope = supervisorSpan.makeCurrent()) {
        TaskPlan plan = decompose(userGoal);

        for (SubTask subtask : plan.subtasks()) {
            Span workerSpan = tracer.spanBuilder("worker:" + subtask.worker())
                .setAttribute("subtask", subtask.description())
                .startSpan();
            try (var workerScope = workerSpan.makeCurrent()) {
                String result = delegateAndReview(subtask);
                workerSpan.setAttribute("revisions", getRevisionCount());
                workerSpan.setAttribute("tokens", getTokenCount());
            } finally {
                workerSpan.end();
            }
        }
    } finally {
        supervisorSpan.end();
    }
}

8.8 Have a Deterministic Fallback Plan

When the supervisor LLM fails to produce a valid plan — it hallucinated subtasks, missed critical steps, or produced an unparseable response — fall back to a hard-coded default plan. This prevents the entire workflow from failing because of a single bad planning call.

Why it matters: The supervisor’s decomposition is the foundation the entire workflow builds on. If it fails, everything downstream is wasted. A deterministic fallback ensures the workflow can proceed with a sensible default even when the LLM’s planning goes wrong.

public TaskPlan decompose(String userGoal) {
    try {
        TaskPlan llmPlan = supervisorAgent.prompt()
            .user("Decompose this task into subtasks: " + userGoal)
            .call()
            .entity(TaskPlan.class);

        if (isValidPlan(llmPlan)) {
            return llmPlan;
        }
        log.warn("LLM plan failed validation. Falling back to default.");
    } catch (Exception e) {
        log.error("LLM planning failed: {}. Falling back.", e.getMessage());
    }

    // Deterministic fallback — always works
    return new TaskPlan(List.of(
        new SubTask("researcher", "Research: " + userGoal, "Complete findings"),
        new SubTask("coder", "Implement based on research", "Code compiles, tests pass"),
        new SubTask("reviewer", "Review implementation", "No critical issues")
    ));
}

9. Common Mistakes and How to Avoid Them

Six common hierarchical agent failure modes — supervisor micro-management, infinite revision loop, bad task decomposition, supervisor bottleneck, worker scope creep, and lost review context — each paired with its mitigation strategy.
Figure: The six most common failure modes in hierarchical agent systems, with their mitigations.

9.1 Supervisor Micro-Management

What happens: The supervisor’s system prompt says “You can also write code if the coder is struggling.” The supervisor starts doing the work itself — writing code, searching the web, generating content. Its context fills up, its planning quality degrades, and the workers sit idle.

How to avoid it: The supervisor’s system prompt must explicitly prohibit doing worker-level tasks. Its only tools should be delegation, review, and escalation. Test this with adversarial prompts: give the supervisor a coding task and verify it delegates instead of coding.

9.2 Infinite Revision Loop

What happens: The supervisor keeps requesting revisions because the worker can’t meet an acceptance criterion that is ambiguous, contradictory, or beyond the worker’s capabilities. Each round burns 5–15 tool calls, and the loop runs until the global budget is exhausted.

How to avoid it: Cap revisions at 2–3 rounds (see Best Practice 8.3). After the cap, the supervisor must accept the best-effort result, escalate to a human, or try a different decomposition. Log revision reasons to identify recurring quality gaps.

9.3 Bad Task Decomposition

What happens: The supervisor splits the task incorrectly: overlapping subtasks (two workers doing the same work), missing subtasks (a critical step is skipped), or subtasks in the wrong order (the coder is asked to implement before the researcher has findings).

How to avoid it: Validate the plan before execution. Check for duplicate worker assignments, verify dependencies are satisfied (a writer subtask should come after a researcher subtask), and validate that the plan covers all requirements. A structured plan schema (see TaskPlan record above) makes validation straightforward.

boolean isValidPlan(TaskPlan plan) {
    if (plan.subtasks().isEmpty()) return false;
    // Check no duplicate workers in consecutive subtasks
    // Check researcher comes before writer
    // Check all required phases are present
    Set<String> workers = plan.subtasks().stream()
        .map(SubTask::worker)
        .collect(Collectors.toSet());
    return workers.contains("researcher") || workers.contains("coder");
}

9.4 Supervisor Bottleneck

What happens: All data flows through the supervisor. When a researcher produces 10,000 tokens of findings, the supervisor reads them, summarizes, and passes them to the writer. The supervisor’s context fills up with content it doesn’t need to see in full, and it becomes a latency and cost bottleneck.

How to avoid it: Use shared state. Workers write directly to the state store; downstream workers and the supervisor read from it. The supervisor only needs to check state status and summaries, not read full worker outputs. This keeps the supervisor’s context lean and focused on coordination.

9.5 Worker Scope Creep

What happens: A coder agent, given a task to add validation, decides to also refactor the entire class, add logging, and update the README. It goes far beyond its delegated scope, consuming tokens and potentially breaking other parts of the codebase.

How to avoid it: Worker system prompts must include explicit scope constraints: “You are delegated a specific task. Implement ONLY what is described in the task specification. Do not make changes beyond the scope of your assignment.” Combine this with narrow tool access (the coder can only modify files in the specified directory).

9.6 Lost Review Context

What happens: The reviewer agent receives only the worker’s output, without the original task specification or acceptance criteria. It reviews in a vacuum — checking for generic quality rather than verifying that specific requirements were met.

How to avoid it: Always pass the task specification AND the worker output to the reviewer. The reviewer should check the output against the acceptance criteria, not just assess general quality.

// Always include both the spec and the output in the review prompt
String reviewPrompt = """
    TASK SPECIFICATION:
    %s

    ACCEPTANCE CRITERIA:
    %s

    WORKER OUTPUT:
    %s

    Review whether the output meets ALL acceptance criteria. For each criterion,
    state PASS or FAIL with specific evidence."""
    .formatted(spec.taskDescription(),
               String.join("\n", spec.acceptanceCriteria()),
               workerOutput);

10. Real-World Examples

10.1 Devin — Autonomous Software Engineering

Devin (Cognition Labs) is the most prominent production example of hierarchical agent design for software engineering. Its architecture follows the supervisor pattern:

  1. A planning layer reads the GitHub issue, understands the codebase, and decomposes the task into implementation steps.
  2. A coding layer writes the actual code — editing files, creating new modules, updating configurations.
  3. A testing layer runs the test suite and evaluates results.
  4. The planner reviews test results and either approves (opens a PR) or requests revisions from the coding layer.

Devin achieved 13.86% on SWE-bench — and the revision loop is a key part of its success: many issues require 2–3 implementation attempts before the code passes tests.

10.2 MetaGPT — Software Company Simulation

MetaGPT (Hong et al., 2024) is a research framework that simulates a software company as a hierarchical agent system:

  • A Product Manager agent defines requirements.
  • An Architect agent designs the system structure.
  • Engineer agents implement the code.
  • A QA agent reviews and tests.

The hierarchy follows Standard Operating Procedures (SOPs) — each agent has a defined role, inputs, outputs, and quality criteria. MetaGPT demonstrated that structured hierarchical collaboration produces code with significantly fewer bugs than unstructured multi-agent chat (AutoGen-style), because the SOPs prevent agents from going off-script.

10.3 ChatDev — Multi-Role Software Development

ChatDev (Qian et al., 2024) is another hierarchical software development system with explicit role hierarchy:

  • CEO agent defines the project scope.
  • CTO agent makes technical decisions.
  • Programmer agents implement features.
  • Tester agents write and run tests.
  • Art Designer agents create visual assets.

ChatDev showed that role-based hierarchical delegation produces more complete and higher-quality software projects than flat multi-agent collaboration. The key insight: having a CTO that reviews the programmer’s architectural decisions catches structural issues that post-hoc testing alone misses.

10.4 Amazon Q Developer — Code Transformation

Amazon Q Developer uses a hierarchical approach for large-scale code transformations (e.g., Java 8 → 17 migration):

  1. A planning agent analyzes the codebase and creates a file-by-file migration plan.
  2. Transformation agents handle individual file migrations, applying language-specific rules.
  3. A validation agent compiles the transformed code and runs tests.
  4. The planner reviews compilation and test results, re-delegating failed files for correction.

This hierarchical approach handles codebases with hundreds of files — far beyond what a single agent could manage in one context window.

10.5 GitHub Copilot Workspace

GitHub Copilot Workspace applies hierarchical principles to collaborative coding:

  1. A planning phase analyzes the issue and proposes a step-by-step plan.
  2. An implementation phase generates code changes for each step.
  3. A verification phase validates the changes against the plan.
  4. The user acts as the human supervisor, reviewing the plan and implementation before committing.

This human-in-the-loop hierarchical design combines LLM decomposition with human oversight — the most reliable form of hierarchical quality control.


11. Example: Building a Hierarchical Agent System

With Spring AI

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.ai.tool.annotation.ToolParam;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Service;

import java.util.List;

// Step 1: Define the shared state and task specification

public record TaskSpec(
    String worker,
    String taskDescription,
    String inputContext,
    List<String> acceptanceCriteria,
    int maxToolCalls
) {}

public record ReviewResult(boolean approved, String feedback) {
    static ReviewResult approved() { return new ReviewResult(true, ""); }
    static ReviewResult rejected(String feedback) { return new ReviewResult(false, feedback); }
}

public record WorkflowState(
    String userGoal,
    List<TaskSpec> plan,
    String researchFindings,
    String codeOutput,
    String reviewFeedback,
    int revisionCount
) {}
// Step 2: Define worker tools (scoped per worker)

@Component
public class ResearchTools {

    @Tool(description = """
            Search the web for current information. Use when you need \
            up-to-date facts, documentation, or best practices.""")
    public String searchWeb(
            @ToolParam(description = "The search query") String query) {
        return WebSearchClient.search(query);
    }
}

@Component
public class CodingTools {

    @Tool(description = """
            Read the contents of a source file. Use to understand \
            existing code before making changes.""")
    public String readFile(
            @ToolParam(description = "Absolute file path") String path) {
        return FileUtils.readFile(path);
    }

    @Tool(description = """
            Write content to a source file. Use to create or modify \
            source code files.""")
    public String writeFile(
            @ToolParam(description = "Absolute file path") String path,
            @ToolParam(description = "File content to write") String content) {
        FileUtils.writeFile(path, content);
        return "File written: " + path;
    }

    @Tool(description = """
            Run the project test suite. Returns test results including \
            pass/fail status and failure messages.""")
    public String runTests() {
        return TestRunner.runAll().toString();
    }
}
// Step 3: Configure the supervisor and worker agents

@Configuration
public class HierarchicalAgentConfig {

    @Bean
    ChatClient supervisorAgent(ChatClient.Builder builder) {
        return builder.clone()
            .defaultSystem("""
                You are a Project Supervisor. Your ONLY job is to:
                1. Decompose the user's goal into subtasks
                2. Delegate subtasks to workers (researcher, coder, reviewer)
                3. Review worker outputs against acceptance criteria
                4. Request revisions when outputs don't meet the bar
                5. Approve and synthesize final results

                You NEVER do the work yourself. You NEVER search the web, \
                write code, or produce content. You ONLY coordinate.""")
            .build();
    }

    @Bean
    ChatClient researcherWorker(ChatClient.Builder builder, ResearchTools tools) {
        return builder.clone()
            .defaultSystem("""
                You are a Research Specialist. Execute the assigned research \
                task thoroughly. Use your search tools to find accurate, \
                current information. Cite your sources. Stay within the \
                scope of the task specification — do not research beyond \
                what is requested.""")
            .defaultTools(tools)
            .build();
    }

    @Bean
    ChatClient coderWorker(ChatClient.Builder builder, CodingTools tools) {
        return builder.clone()
            .defaultSystem("""
                You are a Software Engineer. Implement the assigned coding \
                task precisely as specified. Read existing code first, then \
                make targeted changes. Run tests after changes. Stay within \
                the scope — implement ONLY what is described in the task \
                specification. Do not refactor, redesign, or add features \
                beyond the assignment.""")
            .defaultTools(tools)
            .build();
    }

    @Bean
    ChatClient reviewerWorker(ChatClient.Builder builder) {
        return builder.clone()
            .defaultSystem("""
                You are a Code Reviewer. Review the provided code against \
                the acceptance criteria. For each criterion, state PASS or \
                FAIL with specific evidence. Flag security issues, bugs, \
                and style violations. Be specific and actionable in your \
                feedback.""")
            .build();  // No tools — reviews from context only
    }
}
// Step 4: Implement the hierarchical orchestration with review loop

@Service
public class HierarchicalWorkflow {

    private static final int MAX_REVISIONS = 3;

    private final ChatClient supervisorAgent;
    private final ChatClient researcherWorker;
    private final ChatClient coderWorker;
    private final ChatClient reviewerWorker;

    public HierarchicalWorkflow(
            @Qualifier("supervisorAgent") ChatClient supervisorAgent,
            @Qualifier("researcherWorker") ChatClient researcherWorker,
            @Qualifier("coderWorker") ChatClient coderWorker,
            @Qualifier("reviewerWorker") ChatClient reviewerWorker) {
        this.supervisorAgent = supervisorAgent;
        this.researcherWorker = researcherWorker;
        this.coderWorker = coderWorker;
        this.reviewerWorker = reviewerWorker;
    }

    public String execute(String userGoal) {
        String correlationId = UUID.randomUUID().toString();
        log.info("[{}] Starting hierarchical workflow: {}", correlationId, userGoal);

        // Phase 1: Supervisor decomposes the task
        List<TaskSpec> plan = supervisorDecompose(userGoal);
        log.info("[{}] Plan created: {} subtasks", correlationId, plan.size());

        // Phase 2: Execute each subtask with review loop
        WorkflowState state = new WorkflowState(userGoal, plan, null, null, null, 0);

        for (TaskSpec task : plan) {
            state = executeWithReview(task, state, correlationId);
        }

        // Phase 3: Supervisor synthesizes final output
        String finalOutput = supervisorSynthesize(state);
        log.info("[{}] Workflow complete", correlationId);
        return finalOutput;
    }

    private WorkflowState executeWithReview(TaskSpec task, WorkflowState state,
                                            String correlationId) {
        String output = delegateToWorker(task, state);
        int revisions = 0;

        while (revisions < MAX_REVISIONS) {
            ReviewResult review = supervisorReview(task, output);

            if (review.approved()) {
                log.info("[{}] {} output approved (revisions: {})",
                    correlationId, task.worker(), revisions);
                return updateState(state, task.worker(), output);
            }

            revisions++;
            log.info("[{}] {} revision {}/{}: {}",
                correlationId, task.worker(), revisions, MAX_REVISIONS,
                review.feedback());
            output = delegateRevision(task, output, review.feedback(), state);
        }

        log.warn("[{}] Max revisions reached for {}. Accepting best effort.",
            correlationId, task.worker());
        return updateState(state, task.worker(), output);
    }

    private String delegateToWorker(TaskSpec task, WorkflowState state) {
        ChatClient worker = switch (task.worker()) {
            case "researcher" -> researcherWorker;
            case "coder" -> coderWorker;
            case "reviewer" -> reviewerWorker;
            default -> throw new IllegalArgumentException(
                "Unknown worker: " + task.worker());
        };

        return worker.prompt()
            .user("TASK: %s\nCONTEXT: %s\nCRITERIA: %s"
                .formatted(task.taskDescription(),
                           task.inputContext() != null ? task.inputContext() : "None",
                           String.join(", ", task.acceptanceCriteria())))
            .toolCallLimit(task.maxToolCalls())
            .call()
            .content();
    }

    // ... supervisorDecompose, supervisorReview, supervisorSynthesize,
    //     delegateRevision, updateState methods follow the same pattern
}

Key Design Decisions

  1. Thin supervisor — The supervisor has no domain tools. It can only coordinate, review, and synthesize.
  2. Scoped worker tools — The researcher has search tools, the coder has file and test tools, the reviewer has no tools (context-only review).
  3. Bounded revision — Max 3 revisions per subtask. After that, accept best effort or escalate.
  4. Structured task specs — Every delegation uses a typed TaskSpec with acceptance criteria.
  5. Deterministic routing — The orchestration is in code (sequential plan execution), not LLM-driven.
  6. Hierarchical tracing — Correlation IDs and per-worker logging enable end-to-end debugging.

References & Further Reading

Foundational Papers

  1. ReAct — Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023. The reasoning-action loop that powers each worker agent in a hierarchical system.
  2. MetaGPT — Hong, S. et al. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, ICLR 2024. Assigns SOPs to agents in a hierarchical structure. Demonstrates that structured role-based collaboration produces fewer bugs than unstructured multi-agent chat.
  3. ChatDev — Qian, C. et al. “Communicative Agents for Software Development”, ACL 2024. Multi-role software development with CEO, CTO, Programmer, and Tester hierarchy — demonstrates that explicit role hierarchy improves code quality.
  4. STORM — Shao, Y. et al. “Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models”, NAACL 2024. Hierarchical research system with coordinator, expert, and writer agents producing Wikipedia-quality articles.
  5. AutoGen — Wu, Q. et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, 2023. Multi-agent framework supporting hierarchical patterns. The v0.4 rewrite (AgentChat) introduced typed messages for production hierarchical deployments.
  6. Reflexion — Shinn, N. et al. “Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. Self-correction through reflection — the theoretical foundation for the supervisor review loop.
  7. Tree of Thoughts — Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023. Exploring multiple reasoning branches — supports the supervisor’s planning and decomposition capability.

Evaluation and Benchmarks

  1. AgentBench — Liu, X. et al. “AgentBench: Evaluating LLMs as Agents”, ICLR 2024. Comprehensive agent benchmark — essential for evaluating both supervisor and worker agents before composing them.
  2. SWE-bench — Jimenez, C.E. et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, ICLR 2024. Gold-standard coding benchmark. Hierarchical systems (Devin, OpenHands) increasingly outperform flat agents. SWE-bench Verified provides a human-validated subset.
  3. SWE-Agent — Yang, J. et al. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”, 2024. Cost and iteration analysis — hierarchical revision loops can push per-task costs to 50–150 LLM calls.

Safety and Security

  1. OWASP Top 10 for LLM ApplicationsOWASP Foundation, 2023–2025. “Excessive Agency” risk multiplied in hierarchical systems — both supervisor and workers are attack surfaces.
  2. Prompt Injection — Greshake, K. et al. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, 2023. Injection in a worker’s tool output can influence the supervisor’s review — motivating worker isolation and output validation.

Industry Guides

  1. Building Effective Agents — Anthropic, “Building effective agents”, 2024. Defines the “orchestrator-workers” pattern that underpins hierarchical systems. Recommends keeping orchestrators focused on coordination, not execution.
  2. A Practical Guide to Building Agents — OpenAI, “A practical guide to building agents”, 2025. Covers task decomposition, deterministic vs. LLM-based routing, and quality gates — all relevant to hierarchical supervisor design.
  3. Agent2Agent (A2A) Protocol — Google, “Agent2Agent Protocol”, 2025. Open standard for inter-agent communication — enables hierarchical agents to delegate to workers across organizational boundaries.
  4. Model Context Protocol (MCP) — Anthropic, modelcontextprotocol.io, 2024. Shared tool servers prevent tool duplication across workers in a hierarchical system.

Books

  1. Stuart Russell & Peter Norvig — Artificial Intelligence: A Modern Approach, 4th ed., Pearson, 2020. Part IV covers multi-agent coordination, hierarchical planning, and task decomposition — the theoretical foundation for hierarchical agent architectures.
  2. Chip Huyen — AI Engineering, O’Reilly, 2025. Covers agent architectures, observability, and evaluation. Discusses hierarchical patterns in the context of production reliability.
  3. Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide covering the building blocks that power worker agents in hierarchical systems.
  4. Harrison Chase & Jacob Lee — LangChain Documentation & Guides, LangChain, 2023–2026. LangGraph’s hierarchical agent tutorials demonstrate supervisor-worker patterns with typed state and conditional routing.
  5. Andrew Ng — AI Agentic Design Patterns with AutoGen, DeepLearning.AI, 2024. Covers the reflection and planning patterns that power the supervisor review loop.

Tools & Platforms

  1. Spring AI — Multiple ChatClient instances with qualifier-based DI compose naturally into hierarchical workflows. The supervisor is a ChatClient with coordination-only tools; workers are ChatClients with domain-specific tools.
  2. LangGraph — Graph-based orchestration with conditional routing. The supervisor is a node that routes to worker nodes based on state, with built-in support for revision loops via cyclic edges.
  3. OpenAI Agents SDK — Supports hierarchical patterns through nested agent invocation and handoffs. The supervisor agent can invoke worker agents as tools.
  4. Google Agent Development Kit (ADK) — Open-source framework with native support for hierarchical agent compositions and A2A delegation.
  5. CrewAI — Supports hierarchical process mode where a manager agent delegates to crew members and reviews outputs.
  6. AutoGen — GroupChat with a manager agent implements the hierarchical pattern. The v0.4 AgentChat rewrite provides typed messages for supervisor-worker communication.
  7. LangSmith — Hierarchical trace visualization renders supervisor-worker relationships as parent-child spans.
  8. Arize Phoenix — Open-source LLM observability supporting nested trace visualization for hierarchical workflows.