Grounding

Agentic RAG: Teach Your Agent When to Search

11 min read  ·  MindTechLabs

Most developers learn RAG as a pipeline: embed your documents, store them, search on every query, inject results into the prompt. That approach works — but it wastes tokens on every single message, even ones that don't need any retrieval at all. Agentic RAG fixes this by giving your agent the judgment to decide when searching is actually worth it.

The Problem with Always Searching

Consider a customer support agent with a static RAG pipeline. Every time a user sends a message — "Hi", "Thanks", "Sounds good" — the pipeline dutifully embeds the text, queries the vector index, retrieves five documents, and stuffs them into the prompt before calling the LLM. For a greeting, you've just spent 600–800 tokens on results that contribute nothing to the answer.

Across thousands of conversations, this adds up to real cost and real latency. More importantly, injecting irrelevant context can actually degrade response quality — the model has to sift through noise to find its answer.

Think about it this way: A good librarian doesn't run to the stacks for every question a patron asks. They listen first, decide whether a book is actually needed, then go get it. Static RAG is the librarian who sprints to the shelves the moment anyone enters the building.

What Agentic RAG Looks Like Instead

In agentic RAG, search is a tool — something the agent can call when it decides retrieval is warranted. The system message describes when to use it. The agent reads the user's message, determines whether the answer requires retrieved context, and only then invokes the search.

The configuration is mostly in your system instruction:

system_message = """You are a customer support assistant for Contoso.
You have access to a search_knowledge_base tool.
Use it ONLY when the user asks about:
  - specific products, pricing, or availability
  - return or refund policies
  - their order status (requires order ID)
For greetings, general questions, or anything you can answer from
general knowledge, respond directly without searching."""

Embeddings: The Mechanism Behind Semantic Search

Before writing the full pipeline, it's worth understanding what embeddings actually are — because this is where the "semantic" in semantic search comes from.

An embedding model takes a piece of text and outputs a list of floating-point numbers — typically 1,536 values for text-embedding-3-small. These numbers represent the meaning of the text in a high-dimensional space. Two texts with similar meaning produce vectors that are mathematically close to each other, even if they share no words in common. That's how "How do I get my money back?" matches "Return and Refund Policy" — the concepts are semantically similar even though the vocabulary doesn't overlap.

The golden rule of embeddings: Always use the same model to embed your documents at index time and to embed queries at search time. Mix models and you're comparing measurements in different units — the distances become meaningless and your search quality collapses.
ModelDimensionsCostBest for
text-embedding-3-small1,536LowerMonolingual workloads, cost-sensitive high-volume search
text-embedding-3-large3,0722× higherMultilingual support (100+ languages), higher precision tasks

Azure AI Search: What You Need to Know Before Starting

  • Standard tier is the minimum for vector search. The Basic tier doesn't support it — a common setup mistake.
  • You need three RBAC roles assigned to your identity or managed identity: Search Service Contributor, Search Index Data Contributor, and Search Index Data Reader.
  • Azure AI Search Standard tier is billed hourly, not per query. Leave it running overnight by accident and you'll notice it on your bill.
  • Results include a relevance score @search.score between 0 and 1. Filter out anything below 0.7 before passing to the LLM — low-confidence results add noise without adding value.

The 5-Step RAG Pattern in Code

Every agentic RAG implementation follows the same five steps. Here's a complete, working example:

from openai import AzureOpenAI
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
import os

openai_client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-02-01"
)

search_client = SearchClient(
    endpoint=os.getenv("SEARCH_ENDPOINT"),
    index_name="product-knowledge",
    credential=AzureKeyCredential(os.getenv("SEARCH_KEY"))
)

def search_knowledge_base(query: str, top_k: int = 3) -> str:
    # Step 1: Embed the query using the SAME model as your documents
    embedding = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    ).data[0].embedding

    # Step 2: Run a hybrid search — keyword + vector simultaneously
    vector_q = VectorizedQuery(
        vector=embedding,
        k_nearest_neighbors=top_k,
        fields="content_vector"
    )
    results = search_client.search(
        search_text=query,          # Keyword component
        vector_queries=[vector_q],  # Semantic component
        select=["title", "content"],
        top=top_k
    )

    # Step 3: Filter by relevance — discard low-confidence results
    chunks = [
        r["content"] for r in results
        if r.get("@search.score", 0) >= 0.7
    ]

    # Step 4: Build the grounding context string
    return "\n\n".join(chunks) if chunks else "No relevant information found."

def answer_with_rag(user_question: str) -> str:
    grounding = search_knowledge_base(user_question)

    # Step 5: CRITICAL — grounding context goes BEFORE the user question
    # The LLM processes the prompt top-to-bottom.
    # Facts presented first = model reasons from evidence, not training data.
    prompt = (
        "Use the following information to answer the question:\n\n"
        f"{grounding}\n\n"
        f"Question: {user_question}"
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful customer support assistant."},
            {"role": "user",   "content": prompt}
        ]
    )
    return response.choices[0].message.content
Why the order matters: Placing grounding results before the user question isn't just stylistic — it measurably improves factual accuracy. When the LLM reads retrieved context first, it anchors its reasoning to that evidence. When the question comes first, the model starts reasoning from its training data and may only partially override that with the retrieved context.

Choosing Your Grounding Source

SourceUse whenKey note
Azure AI SearchLarge static document collections, complex relevance tuningHourly billing; requires CDC setup for live data
Cosmos DB (vector)Data changes frequently (>10x/day), need real-time accuracyVectors update automatically with document changes
Microsoft Fabric / OneLakeLive enterprise data across systemsAI Search indexes OneLake directly — no data copy, built-in CDC
Bing SearchPost-training events, live internet data, current newsPer-query cost; rate limits apply

For Cosmos DB feeding AI Search: vectors in Cosmos update automatically, but the AI Search index doesn't know about those changes unless you wire up an Azure Function triggered by the Cosmos DB Change Feed. Fabric is simpler — changes in OneLake are reflected in AI Search automatically.

Test What You've Learned

39 AI-103 practice questions — RAG and grounding scenarios included.

Open AI-103 Exam Prep →
← Back to All Blogs