From the Build · 13 min · June 1, 2026

How to Know If Your Multi-Agent System Is Built Correctly

The system runs. It does what you built it to do, most of the time, with inputs it has seen before. But ask a new team member to explain what any single agent in it does — not the whole pipeline, just one — and watch what happens. They pull up three files, trace a call into a fourth, flip to the system prompt, come back. Twenty minutes to produce a one-sentence answer.

That is the tell. A multi-agent system built correctly looks different from the outside: every agent can be described independently, without reading its internals, because the design forced that description to exist before the code did. The spec is the architecture. If you cannot write it, you have not finished thinking.

The test

Before asking whether your system is correct, ask a harder question: can every agent in it be specified? Not documented after the fact — specified first, as a contract that would exist even if the implementation were never written.

An agent is independent only when someone who has never read its source can answer five things cold:

  1. When should this agent be used? — the trigger, not the biography.
  2. What does it need to start? — its declared input, typed and bounded.
  3. What does it hand back? — its output, in a shape the next consumer already expects.
  4. How do we know it is truly done? — an objective gate, not "looks good."
  5. What test cases prove it works? — a fixed input → expected-output suite.

If any of those questions leaves you with a blank, the agent is not independent. It is a piece of something larger that only the original author can hold in their head. That is not an agent system; it is a distributed monolith with good intentions.

A blank answer isn't a "do it later." It's a risk you can already see.

The value of specifying agents before building them is exactly this: the gaps become legible. Most multi-agent failures happen in gaps — undefined handoffs, unspecified fallback behavior, agents scoped to the happy path and unprepared for anything else.


Step 1 — Write the spec before the agent

For every agent in your system, fill in the following form before writing a line of implementation. The five must-haves are marked. Leave optional lines blank to accept a sensible default. Do not move to code until every required field has an answer.

# Agent Spec: <agent-name>

## Identity
Name:        <short name>                   # REQUIRED
Trigger:     Use when …                     # REQUIRED — a condition, not a bio
One liner:   <what it does>                 # "and" = two agents
Owner:       <team or person>

## Thinking
Persona:     <its role>
Goal:        <single goal>                  # REQUIRED
Method:      gather → act → verify → repeat
Hard rules:  <always / never>
Inputs:      <how to interpret what it receives>
Output shape: <what a finished answer looks like> # REQUIRED
Tools:       <which + when to reach for each>
Self-check:  <how it verifies before "done">
When stuck:  ask | return partial | stop
Delegates:   <when + what to carry along>   # if it orchestrates
Approval:    <checkpoint + what you sign off | none>

## Contract
Inputs:      <what it must receive to start>  # REQUIRED
Output:      structured result | finished document  # REQUIRED
  Shape:     <fields or description>
Done when:   <objective gate, not "looks good">  # REQUIRED

## Compute
Model:       deepest | balanced | fastest
Effort:      low | medium | high | max
Reasoning:   yes | no

## Permissions
Tools:       <list | all standard | none>
External:    <systems it can reach | none>
Autonomy:    ask first | auto | plan only | pre-approved
Guard:       <what to check dynamically | none>

## Limits
Max turns:   <number>       # REQUIRED for unsupervised agents
Timeout:     <time>
Sandbox:     <files + services it is fenced into>
Hooks:       before tool call | after tool call | none

## Execution
Start:       standalone | as a subagent | as main agent
Waiting:     someone waits | background
Sessions:    none | persist | resume where it left off
Streaming:   yes | no

## Proof
Progress:    <what it reports while running>
Cost track:  yes | no
Eval set:    <where examples live>          # REQUIRED

Step 2 — Read the gaps before you ship

A filled spec does two things: it defines the agent, and it shows you what you do not yet know about it. The required fields are not bureaucracy. Each one is load-bearing.

When should it be used — in an orchestrated system, this is the only signal the router uses to decide which agent gets the work. It reads this field, nothing else. If it is written as a description ("this agent handles data extraction") rather than a trigger ("use when a raw document needs structured fields pulled before any downstream step can run"), the router cannot reliably choose it — and will make arbitrary calls under ambiguous conditions. This is the most underrated field in any agent definition.

What it needs to start — no agent is truly independent without a declared input. If the answer is "whatever the previous step passes," you have a dependency masquerading as modularity. Name the fields. Type them. State what happens when one is missing.

What it gives back, and in what shape — if another agent consumes the result, the output must be structured: typed fields, consistent shape, no prose. If a human consumes it, prose is fine. These are fundamentally different contracts, and conflating them breaks pipelines silently, in ways that only surface under real load.

How we know it is done — "looks good" is not a finish line. A done-condition is an objective gate: a required field is populated, a check passes, a count reaches a threshold. Without it, agents run until they hit a limit you did not plan around.

Test cases — the difference between an agent and a reliable agent is a fixed set of inputs with expected outputs you can run against every change to the prompt or tools. Without an eval set, you find out what broke in production, while someone is waiting.


Step 3 — Check the routing, not just the behavior

Most teams write a system prompt and call it done. But in a multi-agent system there are two entirely separate things to write, and they serve different readers.

The trigger description — the "when should it be used" field — is read by the orchestrator. It decides routing. Phrase it as a trigger condition. Start with "Use when." It should tell the system exactly when to hand work here, and implicitly when not to.

The prompt — what the agent knows, how it reasons, what it produces — is read by the model at runtime. This is where the persona, methodology, and constraints live.

These two things look like they overlap. They do not. An agent whose trigger description is vague will be called at the wrong time, no matter how good its prompt is. Route first, behave second. If you cannot write the "Use when" sentence, the agent's scope is not clear enough to exist yet.


Step 4 — Test the subagent handoff in isolation

If your system has agents that spawn other agents, there is one rule that changes everything else about how you design the handoffs:

Subagents start in a completely fresh, isolated context.

They do not inherit the parent's conversation. They cannot see prior tool results. They do not know what the orchestrator knows. The parent's world ends at the edge of the kickoff prompt.

This has three concrete consequences.

The only way data reaches a subagent is the kickoff prompt string. Everything the subagent needs must be packed in explicitly — IDs, prior decisions, relevant excerpts, error messages from earlier steps. Assume the subagent wakes up knowing nothing about the world.

The only thing that returns to the parent is the subagent's final message. Intermediate reasoning, tool calls, partial results — all of it stays inside the subagent and is discarded. The output contract must put everything the parent needs into that one final message. If the parent expects structured fields, the subagent must produce them — not prose, not a summary, not an apology.

Parent and child share no memory unless you route it through files or a persisted session. If two subagents need to coordinate, the orchestrator is responsible for passing the context. There is no other channel.

A subagent is a clean room. One door in, one door out. Design the handoff accordingly.

Most multi-agent bugs trace back to one of these three rules. A subagent that "should have known" something the parent knew — but did not, because it was never packed in. A parent that expected structured output and received a paragraph, because the output contract was never written.

To check your system: take each parent-to-subagent handoff and ask whether the subagent could do its job if the parent had never existed. If the answer is no, something is missing from the kickoff prompt.


Step 5 — Count the permission layers

Tool access in an agent system is controlled by stacked layers, from coarse to fine. Most teams only think about one of them.

The first layer is the menu — which tools exist for this agent at all. The second is the wall — tools removed from context entirely, which the model cannot see or call. The third is the posture — the global stance: does the agent ask before acting, act automatically, plan only, or require pre-approval for everything?

These three layers are static. They decide based on the tool's name or identity. They cannot inspect what the agent is actually attempting to do with a given call.

The fourth layer is a dynamic guard — a callback that runs before every tool call and inspects the real arguments. This is where you enforce rules like "the agent may use the file tool, but only on paths inside its working directory" or "the agent may search, but not on restricted domains." The static layers cannot express these rules; only the callback can, because it sees the actual call, not just the tool name.

The failure mode is treating the first three layers as sufficient and discovering in production that a correctly-scoped agent did something technically permitted but contextually wrong. The guard is what closes that gap.

For each agent in your system: state what it can do, state what it explicitly cannot, and state what must be checked dynamically based on what it is actually trying to do. If you cannot answer the third question, the agent's permission boundary is not fully drawn.


Step 6 — Give every agent its own eval set

Testing the pipeline end-to-end tells you whether the system produced the right answer. It does not tell you which agent was responsible when it did not. In a five-agent pipeline, an end-to-end failure is a five-way alibi. You need a test harness per agent, not just per system.

For each agent, define a fixed set of inputs paired with either an expected output or a judgment function. These are the three approaches, roughly in order of preference for each situation:

Fixed input → expected output. The strongest form. You run a known input through the agent and compare the result against a canonical answer field-by-field. Works well for agents with structured, deterministic output: an extractor that should return a specific set of typed fields, a classifier that should return a specific label, a formatter that should follow a specific shape. When the prompt or tools change, you run the suite and find out immediately which fields regressed. This is the approach you want for any agent whose output feeds another agent, because the downstream consumer depends on that structure.

Heuristic function. When the output is not fully deterministic but still has properties you can check programmatically: response length is within a range, required fields are present and non-empty, no field contains a hallucinated identifier format, a citation points to a real source, a plan contains at least one step. Heuristics do not verify quality — they verify shape and sanity. They catch the obvious failures fast without needing a human or another model in the loop. Write them for any property of the output that you can express as a function, and run them on every eval case.

LLM-as-judge. When the output is prose or open-ended and its quality cannot be reduced to a deterministic check, a second model evaluates it against explicit criteria. The criteria are the key part — not "is this good?" but "does this answer cite evidence for every claim?", "does this plan contain steps that contradict each other?", "does this summary accurately represent the source without adding information that was not there?" The judge model needs the same input the agent received, the agent's output, and a rubric. Without a rubric, you are asking for an opinion. With one, you are running a measurement.

The difference between an agent and a reliable agent is a regression suite. The difference between a regression suite and a useful one is that it runs on every change.

A few things that follow from this:

Eval sets belong to the agent, not the system. Each agent has its own input fixtures and judgment logic, maintained alongside its prompt. When the prompt changes, the eval suite runs. When the suite passes and the system still fails, you know the problem is in the handoff — not the agent.

The judgment method should match the output type. Structured output → fixed comparison. Constrained prose → heuristics first, LLM-as-judge for the remainder. Open-ended prose → LLM-as-judge with an explicit rubric. Mixing these up wastes effort: running an LLM judge on a field that could be checked with a regex is slow and introduces variance where none is needed.

The eval set is the spec made executable. Every required field in the spec has a test case that proves it is being met. The "how we know it is done" field becomes the pass condition for a real fixture. The "what does a finished answer look like" field becomes the shape you check against. If you have written the spec and cannot write at least three eval cases that follow directly from it, the spec is too vague.


What a correct system looks like from the outside

A multi-agent system built correctly is one where every agent can be handed to someone who did not build it, and they can answer — without reading the implementation — what it does, what it needs, what it returns, and how you would know if it were broken.

The spec form is how you get there. Not written after the agents are built, as documentation that will decay the moment the next sprint begins. Written first, as the design surface where you discover the questions you have not yet answered. The implementation is what you write once the spec has no blanks left in required positions.

Empty required fields are not placeholders. They are the shape of your next production incident, written in white space, visible now, while the cost of filling them is still low.

Write the spec. Read the gaps. Then build the agent.