Is My Multi-Agent Architecture Well Designed? A Step-by-Step Audit
Most multi-agent systems aren't designed badly from the start. They drift. An agent gets added in a sprint without a spec. A handoff that worked on day one silently breaks when the upstream agent changes its output format. A trigger description that was precise enough last month is now ambiguous enough that the router picks the wrong agent under load. Nobody notices until a user sees it.
This post is for auditing an existing system, not building a new one. Run each step against every agent you have. The ones that fail are your production incidents waiting to happen.
Step 1 — Run the independence test on every agent
What to check. For each agent in your system, answer these five questions without opening any source file:
- When exactly should this agent be used — what condition triggers it, not others?
- What does it receive to start — every required field, named and typed?
- What does it return — the exact shape, not "a result"?
- How do you know it finished successfully — the objective gate?
- Where are the test cases — the actual files you can run right now?
Signs of trouble. You have to open files to answer any of these. You start a sentence with "it kind of handles…" or "the orchestrator figures that out." Multiple people give different answers to the same question.
The fix. For every agent that fails this test, write the spec before touching the code. Not after — the spec is how you discover what you actually don't know about the agent yet. Start with the hardest one: the agent nobody can explain clearly is almost always the one causing the most problems in production.
The agent nobody can explain is the one production understands best — at your expense.
Step 2 — Search every agent definition for the word "and"
What to check. Find the one-sentence description of what each agent does. Search it literally for the word "and." Do the same for the agent's stated goal, its trigger condition, and any summary of its responsibilities. Count how many times "and" appears.
Signs of trouble. "It researches the subject and drafts the recommendations." "It validates the input and formats the output for the next step." "It calls the API, parses the response, and writes the result to the store." Each "and" is a seam between two responsibilities that someone decided to keep in one place. That decision felt convenient at the time. It becomes a problem when one responsibility needs to change without disturbing the other — which is almost always.
An agent with two responsibilities has two reasons to fail. It has two sets of edge cases to handle. It has two things that can drift independently when the system around it changes. And critically, it has two things that need to be tested — but because they live in one agent, they tend to get tested together, which means neither is tested well.
The fix. Every "and" in an agent's description is a split candidate. For each one, ask: could these two things run independently? Could one succeed while the other fails? Could someone else in the system need just one of them, not both? If the answer to any of those is yes, the agent should be two agents.
The mechanical test: try writing a one-sentence description of each half separately. If both sentences make sense as independent agents — each with its own input, output, and done-condition — the split is correct. If one of them cannot stand alone, it is not a true separation; it is just moving the seam. In that case, rethink where the boundary belongs, or accept that the coupling is real and document it explicitly rather than hiding it inside one agent's responsibilities.
If you need "and" to describe what an agent does, you have described two agents. Build two agents.
Step 3 — Audit every trigger description
What to check. Collect the description or routing condition for every agent. Read each one and ask: could a router use this to reliably choose this agent and no other? Then ask: could it tell when NOT to use this agent?
Signs of trouble. The description reads like a job title: "handles research tasks," "manages the output phase," "responsible for data extraction." These are biographies. A router matching on biography language will call the wrong agent in any edge case — and edge cases are the majority of real traffic.
The fix. Rewrite every description as a trigger condition. Start with "Use when." End the sentence with a specific situation that would make this agent the right choice and implicitly exclude every other agent. If you cannot write that sentence without also describing another agent's territory, the two agents overlap and need boundaries.
A useful test: write two agent descriptions side by side. If a random input could plausibly match both, the routing is ambiguous. Pick a representative edge case and trace which agent should win. If the answer is "it depends on context the router cannot see," you have a design problem — not a configuration problem.
Step 4 — Audit every subagent kickoff
What to check. For each parent-to-subagent handoff, list every piece of data the subagent uses to do its job. Then check: is every item on that list explicitly packed into the kickoff prompt? Or does the subagent rely on context it would inherit "automatically"?
Signs of trouble. The subagent's output changes depending on what happened earlier in the session — but it has no way to know what happened earlier, because it starts fresh. Silent failures that only appear when the orchestrator changes how it formats something upstream. A subagent whose prompt says "based on the previous analysis" but has no way to receive that analysis.
Subagents do not inherit the parent's conversation, tool results, or reasoning. They receive exactly what is in their kickoff prompt string, and nothing else. This is not a limitation to work around — it is the isolation guarantee that makes subagents composable. The cost is that the kickoff must be self-contained.
The fix. For every handoff, build a kickoff checklist: a list of every field the subagent needs, where it comes from in the parent, and how it is serialized into the kickoff string. If a field is missing from the parent's output contract, the parent cannot pass it — and now you know you need to add it.
If the subagent could not do its job starting from a blank session with only the kickoff prompt, something is missing from the kickoff.
Step 5 — Audit every output contract
What to check. For each agent, look at what it actually returns. Is there a declared schema — named fields, types, clear structure? Then look at how the consumer uses the output. Does it reference named fields, or does it parse free text?
Signs of trouble. The consuming agent's prompt contains phrases like "interpret whatever the previous step returned," "extract the relevant parts from the response," or "use your judgment about the format." Code around the handoff has defensive fallbacks for missing fields. The output format changed once and the consumer silently started returning wrong answers.
The fix. Every agent whose output feeds another agent must have a typed output schema with named fields. Prose output is only acceptable when the consumer is a human — anything a model reads must be structured. Write the schema as part of the spec, not as a comment in the consumer's prompt.
If you cannot declare the schema because "the output varies depending on the task," that is not a flexible design — it is an undeclared contract. A contract that is not written down is a contract that will be violated silently.
Step 6 — Audit the permission envelope of each agent
What to check. For each agent, list: what tools does it have access to? What is it explicitly blocked from? What requires a dynamic check based on the actual arguments of each call?
Signs of trouble. The answer to "what can it do" is "all the standard tools" or "same permissions as the other agents." An agent that reads files also has access to execute shell commands. An agent that searches the web also has write access to the workspace. Permissions were never narrowed from the default because narrowing felt like extra work.
The failure mode is that a correctly-scoped agent — one doing exactly what it was told — does something technically permitted but contextually wrong. A research agent that reads a file it should not have accessed. An agent that makes an external call that was supposed to be blocked in this context.
The fix. Define the minimum viable tool set for each agent. List what it can use. List what it is explicitly denied. Then identify anything that requires a runtime check — not based on the tool name, but based on the actual arguments. "The file tool is allowed, but only on paths inside the agent's working directory" cannot be expressed as a deny list of tool names. It requires a callback that inspects the actual call before it runs.
Step 7 — Audit every unsupervised agent for a turn limit
What to check. Which agents run without a human in the loop? For each one: is there an explicit maximum number of turns before it stops and reports? Is there a cost ceiling on what it can spend in a single run?
Signs of trouble. Agents that theoretically run until they are "done" with no numeric bound. A research agent that can make unlimited web calls. A loop agent whose stop condition exists only in its prompt as a polite suggestion. The first time you see the cost of a single run is in the billing statement.
The fix. Set a turn limit on every agent that runs unsupervised. Pick a number you would be comfortable explaining to a user whose account the run charged against. Then set a cost ceiling that aborts the run if that number is reached — this is separate from the turn limit, and both are needed. The turn limit caps runaway loops; the cost ceiling caps runaway tool calls within a turn.
A useful calibration: run the agent on a normal input and note the actual turn count. Set your limit at roughly twice that. If the agent regularly needs more than double its typical count to complete a task, the task is too large for a single agent.
Step 8 — Audit the eval coverage
What to check. For each agent: how many fixed input → expected output pairs exist as runnable test fixtures? When did you last run them? Do they cover the happy path, an edge case, and a failure case?
Signs of trouble. Zero fixtures. "We test it manually when something seems off." End-to-end tests that pass after a prompt change, while one agent is silently returning a slightly different shape that a downstream agent is compensating for.
End-to-end tests tell you the pipeline produced an answer. They do not tell you which agent produced a wrong answer that another agent happened to recover from. When that recovery stops working — because you changed a different agent — you will not know which one broke.
The fix. Write three fixtures per agent. Start with the most important case, then the most likely failure mode, then one edge case you have seen in production or can clearly imagine. Each fixture is an input paired with either a specific expected output (structured agents) or an explicit rubric (open-ended agents). Run them on every change to the agent's prompt or tools.
Three is not comprehensive coverage. It is the minimum that converts "I hope this still works" into "I know this still works." Build from there.
Step 9 — Audit the failure contracts
What to check. For each agent: what does it return when it cannot complete its task? Does the orchestrator check for that? What happens to the pipeline when one agent fails?
Signs of trouble. Failures are swallowed: the agent returns an empty result or a generic error string, the orchestrator interprets it as success, and the pipeline continues with bad state. Or the opposite: a single agent failure crashes the entire run with no partial output and no explanation of where things broke.
The fix. Define a failure contract for each agent alongside its output contract. What does it return when it receives bad input? When a tool call fails? When it exhausts its turn limit? The orchestrator must check the return value, not assume success. If you cannot describe how the agent fails — what it returns and what the caller should do — you cannot handle the failure gracefully.
A practical test: intentionally pass the agent a malformed or empty input. Watch what comes back to the orchestrator. If the orchestrator cannot tell the difference between a successful result and a failure result, you have found your next production incident before it found you.
Running the full audit
Go through all nine steps for every agent, not just the ones you suspect. The agents that feel solid are often the ones that have drifted the furthest — they were specified once, worked well, and were never revisited when the system around them changed.
For each failure you find, the fix follows the same pattern: write the thing that should have existed before the agent was built. The spec. The output schema. The kickoff checklist. The eval fixture. The failure contract. These are not overhead. They are the interface. An agent without them is not a component — it is a black box that the rest of your system is hoping stays the same.
Keep a running score as you audit: how many agents pass all eight steps without a gap? For most teams, the answer is fewer than they expect. That gap is the real architecture — not the diagram, but the distance between what was designed and what was shipped.
Architecture that only works when nothing changes is not architecture. It is a bet on staying lucky.
The audit is not a one-time event. Every new agent resets the count. Run it again.