From the Build · 10 min · June 15, 2026

How to Make Your Agent 100× Faster and More Accurate in 5 Steps

Most agents are not slow because the model is slow. They are slow because they reason their way through work that a plain tool could have finished in one call, discovering a URL mid-thought, waiting, deciding what to read next, waiting again. The model is doing a search engine's job, a scraper's job, and a database's job, one expensive turn at a time. Speeding that up is not a prompt-tuning problem. It is a structural one.

This is the method we used to fix it: five steps, in order, each one earning the next. The running example is a research agent we run in production. The diagrams below are from its real logs, anonymized, with the numbers rounded. Treat the "100×" as a statement about magnitude, not a benchmark: on this particular agent the median run fell from about 336 seconds to roughly 53, a little over six times faster, and accuracy went up, not down. The multiple you get is yours to measure. The reasoning behind it is the part that transfers.

This is for an agent you already have and already find too slow, not a greenfield design. And one sentence runs through all five steps: most of the time you don't need an LLM, you need a better tool, and the tool is usually faster and cheaper at the same time.

Step 1: Aggregate the logs and actually read them

What to check. Pull every production run of the agent into one place. Not a sample, all of them. Lay each run out call by call: what it called, in what order, how long each call took, and how long it spent thinking between calls. Then aggregate across runs: median duration, where the time goes, how many calls are redundant, how many simply fail.

What you find. You cannot optimize what you have never looked at, and the aggregate is always more damning than your memory of any single run. Across a few dozen of ours the picture was not subtle: more than half of the wall-clock was input/output wait between sequential rounds, a large share of calls returned nothing usable (login walls, dead URLs, empty results), and the single biggest block of time was the model composing its entire output in one pass at the very end. Almost none of the slowness was "thinking."

What to do. Build the aggregate view before you touch anything. Rank the runs by duration and read the slowest one next to the fastest. The gap between them is your real opportunity, and it is almost never the model. It is structure: too many sequential rounds, calls that should never have fired, and one giant write bolted onto the end.

Before you optimize a single call, count them. Half your latency is usually work that never needed to happen.

Step 2: Draw the dependency graph, and look for the human flow

What to check. Take one representative run and map what truly depends on what. For every call, ask: did this need the result of an earlier call, or did it only happen after it by accident of how the model planned? Mark the hard dependencies. Mark the human-in-the-loop gates separately. Those are real waits, but they are not compute.

What you find. Most "agent decisions" turn out to be a fixed pipeline a human would recognize as a checklist: confirm identity, read the site, read the recent posts, check the competitors, synthesize. The agent rediscovers that checklist from scratch on every run, slowly, and in a different order each time. In ours, a dozen searches the model fired one after another had no dependency on each other at all; they were sequential only because the model thought of them in sequence. The one genuine wait in the whole run was a human approval gate, which is a different kind of delay entirely.

Step 2: one real run, call by call. What actually happened: long sequential batches, calls that return nothing, and a single human-approval gate near the end.

What to do. Sort every call into three buckets: genuinely dependent, human gate, and merely sequential out of habit. The last bucket is almost always the largest, and it is free money, work the agent serialized for no reason. If a human would run this as a known checklist, encode it as one, so the agent stops paying to rediscover its own pipeline on every run.

If you can write the agent's job as a checklist a new hire could follow, you don't have a reasoning problem. You have a scheduling problem.

Step 3: Turn the graph into a timeline

What to check. Redraw the dependency graph on a time axis. Put every independent piece of work into the earliest wave it could start in. Anything with no dependency on the wave before it moves left, to run in parallel. What remains on the longest path from start to finish is your critical path, the only thing that actually sets total time.

What you find. The work collapses into a handful of parallel waves. The thirty-odd sequential calls from the observed run drop into four waves that run at once, and the total time stops being the sum of the calls and becomes the slowest path through them. Suddenly the question is no longer "how do I make each call faster" but "what is actually on the longest chain", and the answer is usually two or three steps, not thirty.

Step 3: the same work re-planned as a timeline. Overlapping bars run simultaneously; only the longest chain (the critical path) sets the clock.

What to do. Schedule the independent work into parallel waves and measure the critical path. Now you know the real target: you cannot go faster than the longest dependent chain, so that chain is the only thing worth attacking next. Everything off it is already free.

Wall-clock time is the slowest path, not the total work. Parallelize first; it costs nothing and tells you exactly what to fix next.

Step 4: Attack the critical path: swap LLM work for tools

What to check. Look at every step on the critical path and ask one question: is a language model the right thing to be doing this job? A model discovering URLs one at a time is a worse, slower, more expensive search engine. A model reading pages one fetch at a time is a worse, slower scraper. For each step, find the deterministic tool that does exactly that job in one call.

What you find. The wins are large and they compound. One site-crawl call replaces a dozen sequential page fetches. One batched search call replaces dozens of one-at-a-time searches. One identity-enrichment lookup replaces an entire discovery round. The trick is to move all of that input/output before the model runs, a pre-flight pass that fans out the deterministic work in parallel and hands the model a finished context package. Now the model's only job is the one thing it is genuinely good at: judgment over assembled facts.

And the part people miss: the tools are cheaper than the agent they replace. A handful of predictable API calls costs less than the many model turns and failed calls they stand in for. You are not trading speed for money. The faster design is also the cheaper one, because every model turn you delete was both slow and billed.

Step 4: the rebuilt flow as parallel waves: deterministic tools fan out before the model runs, and a small number of focused model calls finish the job.

What to do. Replace each critical-path step with its tool, fastest-payoff first. Re-measure the critical path after each swap. The bottleneck moves, and you want to chase it, not the step you already fixed. Keep going until the only thing left on the path is the irreducible model work.

A deterministic tool beats a model at the model's worst job (fetching, searching, looking things up) on speed and cost. Reach for the tool first; reach for the model only for judgment.

Step 5: Evaluate the new flow against the old one

What to check. Build a fixed set of inputs and run both the old flow and the new flow against it. Measure two things, not one: speed and accuracy. Speed is easy. Accuracy is the one people skip, and it is where "faster" quietly becomes "worse" if you are not looking. Define what a correct output is before you start, then score both flows the same way.

What you find. Done right, the rebuilt flow is not just faster. It is more accurate, and for a reason worth internalizing: a deterministic crawler returns the whole site, every time, while a model fetching pages one at a time stops whenever it decides it has "enough." Complete, structured input beats improvised input. But you only get to claim that if you measured it, and the eval is also where you meet the new failure modes: thin inputs, missing data, the long tail. Design those for graceful degradation: lower confidence, not a crash. The eval is how you prove the degradation is graceful.

Step 5: old vs new, to scale: sequential batches and the end-of-run write collapse into a short pre-flight plus a few parallel model calls.

What to do. Run the comparison, keep the fixtures, and re-run them on every change. "More accurate" is earned with evidence, never asserted. The eval set is also what lets you keep optimizing safely later. It is the difference between "I hope this still works" and "I know it does."

Speed without an eval is a guess. Measure accuracy against a fixed set, or you are just shipping a faster way to be wrong.

The real question: do you even need an agent?

Run these five steps and a pattern emerges that is bigger than any single optimization. The agent was slow because it was asked to be the search engine, the scraper, and the database, asked to reason through work that was never reasoning work in the first place. Give those jobs back to the tools built for them and the model shrinks to its real role: the small, high-judgment synthesis at the end.

So before you reach for an agent at all, think hard about what the work actually is. A lot of what looks "agentic" is a fixed pipeline wearing a trench coat, a checklist that a few tool calls and one focused model call will do faster, cheaper, and more reliably than a model improvising through twenty turns. The biggest speedups in this whole method did not come from a faster model. They came from noticing that most of the work never needed one.

The fastest agent is the one that only does the part that needs a model. Everything else is a tool call you haven't written yet.

agents performance latency evaluation

X LinkedIn Email

רוצה את המאמר הבא במייל?