Designing A Retrieval Stack For Agents
The main lesson is simple: retrieval for agents is not just “search, then answer.” It is a control system. The tools you expose, the names you give them, the shapes of their outputs, and the order they invite the model to use them will determine whether the agent behaves like a careful researcher or like a shallow autocomplete system.
The hardest-won insight is that agents do poorly when you give them one powerful search tool and hope prompting will teach restraint. If a tool can both retrieve and conveniently summarize, the agent will overuse it. It will skip orientation, skip disambiguation, skip completeness checks, and go straight to a plausible-looking answer. That is not a prompting failure first. It is usually a tool design failure.
A good retrieval stack separates phases of thinking. In practice, the stack should distinguish at least four functions:
orientation: what exists, where it lives, how the corpus is organized
discovery: which documents are relevant, recent, or likely authoritative
skimming: which pages, sections, rows, or chapters matter
extraction: what exact claim, number, or clause should be cited
Those phases should usually map to different tools, not different undocumented “modes” of one overloaded tool. Once multiple retrieval problems are collapsed into one tool, the agent loses the ability to reason clearly about which action it is taking. A retrieval tool should have one job the model can understand from the name alone.
That leads to the first design principle: separate recall tools from precision tools. Orientation and discovery are recall problems. Page reads and exact fact extraction are precision problems. If recall and precision are mixed together, the model tends to over-optimize for speed and under-optimize for correctness. In enterprise data especially, that creates serious failure modes: it may retrieve an outdated contract version, a draft instead of an executed document, a summary spreadsheet instead of the primary source, or only the highest-ranked five entities when the user asked for all eleven.
A second principle is that folder structure is data. Many agent designers treat folders as storage metadata and text search as the real retrieval system. In enterprise corpora this is wrong. Folder trees often encode the authoritative inventory of entities: customers, deals, counterparties, projects, portfolio companies, fiscal periods, workstreams. If the user asks for “all,” “every,” “each,” or “compare across,” keyword-ranked document discovery is not enough. Relevance is not completeness. The agent needs an inventory step. That may be a folder browser, a collection enumerator, a metadata explorer, or a dataset indexer. But something in the stack must tell the model what exists before it begins ranking what matters.
This is the deeper framing that works well across domains: recall before precision. That does not mean a rigid linear pipeline. It means the default posture of the agent should be: orient first, extract later. In practice the loop is dynamic:
recall: browse or discover what exists
assess: am I confident I have the right scope, versions, and entities?
precision: read or search within known documents
reassess: did I just discover I am missing entities, pages, or versions?
if yes, step back to recall
That loop matters because agents often need to move backward. They read a document and realize it is the wrong version. They search and discover thin results. They find a summary sheet and realize it hints at ten contracts they have not verified. A well-designed stack lets them step back without fighting the tools.
This leads to another important principle: tool outputs must be designed for chaining. If one tool produces identifiers another tool needs, expose them explicitly. If one tool produces citable evidence, expose citation IDs explicitly. If one tool is only for navigation, make that unmistakable. The model should never have to infer how to move from one retrieval stage to the next. If the next step requires document_id, the previous tool should return document_id. If the final answer requires citation_id, only citable tools should expose citation_id. When that contract is muddy, the model invents bridges: constructing citation IDs from document UUIDs, citing summaries as if they were sources, or feeding the wrong identifier to the next tool.
That connects to source authority. A retrieval system should distinguish between at least three kinds of outputs:
navigation outputs: used to move through the space
evidence outputs: used to support claims
synthesized outputs: used to communicate conclusions
These should not be conflated. Navigation outputs like document shortlists or page summaries are valuable, but they are not always citable. Evidence outputs like full page text or extracted clauses are citable. Synthesized outputs may be useful to people, but if they are returned too early by a retrieval tool, they distort agent behavior. The agent starts optimizing for fluent answer generation instead of evidence gathering. A useful rule is: if a tool is not meant to support final claims directly, make it obviously non-citable.
Naming is more important than most engineers expect. Models reason from surface cues. A tool called file_explorer suggests generic browsing. A tool called discover_documents suggests version-aware document retrieval. A tool called read_document suggests page-level evidence access. Names should reflect the model’s mental action, not backend implementation details. Parameters matter too. detail="summary" is much clearer than view="context". pages=[3, 7] is clearer than overloaded reader modes. Tool names and parameter names are not decoration. They are part of the prompt.
Another hard lesson is that relevance ranking is not enough for coverage queries. A ranked shortlist is a great librarian when the question is “which documents are likely relevant?” It is not enough when the question is “which entities are in scope?” In one domain this may mean customer folders. In another it may mean database partitions, projects, issue queues, legal entities, repositories, or experiment runs. The retrieval stack needs a first-class mechanism for completeness, not just salience. Otherwise the agent will repeatedly answer “best matches” when the user asked for “all cases.”
There is also a strong lesson about tool output size. If the agent consumes raw outputs directly, the outputs must be shaped for the agent, not for a backend model with a giant context window. A stack designed around sending hundreds of thousands of characters to a synthesis model will fail once the agent itself becomes the reader. At that point, content budgets, truncation warnings, preview fields, and summary fields all need to be redesigned around the agent’s real observation limits. Output size is not a logging concern. It is part of tool semantics.
A related lesson is that summary fields are often more useful than raw content previews, but only when their role is clear. In several systems, the best signal for choosing where to read next is not the first 2,000 characters of a page; it is a page-level summary, chunk context, row description, or section heading. But if the model is allowed to answer directly from those summaries, it will. So the stack should preserve the distinction:
summaries help choose
full text supports claims
Once that distinction exists, the prompt can teach skimming as a normal bridge step rather than a special case.
One of the strongest practical lessons is that strong models and weak models do not need the same amount of runtime guidance. A good prompt may be enough for strong models. Weak models may still take the shortest path every time. In those cases, a runtime nudge can help, but only if it is narrow and well-placed. Broad intervention logic is usually brittle. A lightweight step-level reminder that only fires on the first step, only for coverage-style queries, and only when the model skips inventory can be effective. But such nudges should be treated as safety nets, not the primary design. If the retrieval stack only works because of reactive runtime patches, the core tool design is probably still wrong.
Even when runtime nudges exist, validation is more powerful than advice. If you need table rows to be cited individually, check that. If you need claims to come from citable tools rather than summaries, check that. If you need “all” queries to reflect an inventory rather than a ranked shortlist, create a notion of completeness failure. Models are often willing to ignore good advice. They are much less able to ignore hard feedback loops that reject incomplete or weakly-supported answers.
This suggests a general hierarchy for agent retrieval design:
fix tool boundaries
fix output contracts
fix names and parameters
teach the retrieval strategy in the prompt
add runtime nudges for weak models
enforce behavior with validators
That order matters. Teams often start at step 4 and over-prompt around bad tools. It works temporarily and then becomes prompt bloat. If the tool stack is clean, the prompt can stay strategic and short.
A practical retrieval stack for agents often ends up looking like this:
an orientation tool: browse collections, folders, scopes, or entities
a discovery tool: shortlist documents with titles, paths, dates, summaries, and IDs
a skim tool: expose section/page/row-level summaries
a read tool: expose exact citable text or structured content
a targeted search tool: search semantically inside already known documents
optional calculators or transformers: turn evidence into derived values, but register them as citable artifacts
This is not specific to one API or one product. It generalizes well to document systems, code search, data warehouses, emails, ticket systems, knowledge bases, and multimodal corpora. The exact tools vary, but the roles do not.
If there is one final principle to keep, it is this: agents should not begin by extracting facts. They should begin by learning what space they are operating in. Once they know the space, precision becomes much easier. Without that orientation step, every search is a guess disguised as retrieval.



