Most developers learn RAG as a pipeline: embed your documents, store them, search on every query, inject results into the prompt. That approach works — but it wastes tokens on every single message, even ones that don't need any retrieval at all. Agentic RAG fixes this by giving your agent the judgment to decide when searching is actually worth it.
The Problem with Always Searching
Consider a customer support agent with a static RAG pipeline. Every time a user sends a message — "Hi", "Thanks", "Sounds good" — the pipeline dutifully embeds the text, queries the vector index, retrieves five documents, and stuffs them into the prompt before calling the LLM. For a greeting, you've just spent 600–800 tokens on results that contribute nothing to the answer.
Across thousands of conversations, this adds up to real cost and real latency. More importantly, injecting irrelevant context can actually degrade response quality — the model has to sift through noise to find its answer.
What Agentic RAG Looks Like Instead
In agentic RAG, search is a tool — something the agent can call when it decides retrieval is warranted. The system message describes when to use it. The agent reads the user's message, determines whether the answer requires retrieved context, and only then invokes the search.
The configuration is mostly in your system instruction:
system_message = """You are a customer support assistant for Contoso.
You have access to a search_knowledge_base tool.
Use it ONLY when the user asks about:
- specific products, pricing, or availability
- return or refund policies
- their order status (requires order ID)
For greetings, general questions, or anything you can answer from
general knowledge, respond directly without searching."""
Embeddings: The Mechanism Behind Semantic Search
Before writing the full pipeline, it's worth understanding what embeddings actually are — because this is where the "semantic" in semantic search comes from.
An embedding model takes a piece of text and outputs a list of floating-point numbers — typically 1,536 values for text-embedding-3-small. These numbers represent the meaning of the text in a high-dimensional space. Two texts with similar meaning produce vectors that are mathematically close to each other, even if they share no words in common. That's how "How do I get my money back?" matches "Return and Refund Policy" — the concepts are semantically similar even though the vocabulary doesn't overlap.
| Model | Dimensions | Cost | Best for |
|---|---|---|---|
| text-embedding-3-small | 1,536 | Lower | Monolingual workloads, cost-sensitive high-volume search |
| text-embedding-3-large | 3,072 | 2× higher | Multilingual support (100+ languages), higher precision tasks |
Azure AI Search: What You Need to Know Before Starting
- Standard tier is the minimum for vector search. The Basic tier doesn't support it — a common setup mistake.
- You need three RBAC roles assigned to your identity or managed identity:
Search Service Contributor,Search Index Data Contributor, andSearch Index Data Reader. - Azure AI Search Standard tier is billed hourly, not per query. Leave it running overnight by accident and you'll notice it on your bill.
- Results include a relevance score
@search.scorebetween 0 and 1. Filter out anything below 0.7 before passing to the LLM — low-confidence results add noise without adding value.
The 5-Step RAG Pattern in Code
Every agentic RAG implementation follows the same five steps. Here's a complete, working example:
from openai import AzureOpenAI
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
import os
openai_client = AzureOpenAI(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2024-02-01"
)
search_client = SearchClient(
endpoint=os.getenv("SEARCH_ENDPOINT"),
index_name="product-knowledge",
credential=AzureKeyCredential(os.getenv("SEARCH_KEY"))
)
def search_knowledge_base(query: str, top_k: int = 3) -> str:
# Step 1: Embed the query using the SAME model as your documents
embedding = openai_client.embeddings.create(
input=query,
model="text-embedding-3-small"
).data[0].embedding
# Step 2: Run a hybrid search — keyword + vector simultaneously
vector_q = VectorizedQuery(
vector=embedding,
k_nearest_neighbors=top_k,
fields="content_vector"
)
results = search_client.search(
search_text=query, # Keyword component
vector_queries=[vector_q], # Semantic component
select=["title", "content"],
top=top_k
)
# Step 3: Filter by relevance — discard low-confidence results
chunks = [
r["content"] for r in results
if r.get("@search.score", 0) >= 0.7
]
# Step 4: Build the grounding context string
return "\n\n".join(chunks) if chunks else "No relevant information found."
def answer_with_rag(user_question: str) -> str:
grounding = search_knowledge_base(user_question)
# Step 5: CRITICAL — grounding context goes BEFORE the user question
# The LLM processes the prompt top-to-bottom.
# Facts presented first = model reasons from evidence, not training data.
prompt = (
"Use the following information to answer the question:\n\n"
f"{grounding}\n\n"
f"Question: {user_question}"
)
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful customer support assistant."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
Choosing Your Grounding Source
| Source | Use when | Key note |
|---|---|---|
| Azure AI Search | Large static document collections, complex relevance tuning | Hourly billing; requires CDC setup for live data |
| Cosmos DB (vector) | Data changes frequently (>10x/day), need real-time accuracy | Vectors update automatically with document changes |
| Microsoft Fabric / OneLake | Live enterprise data across systems | AI Search indexes OneLake directly — no data copy, built-in CDC |
| Bing Search | Post-training events, live internet data, current news | Per-query cost; rate limits apply |
For Cosmos DB feeding AI Search: vectors in Cosmos update automatically, but the AI Search index doesn't know about those changes unless you wire up an Azure Function triggered by the Cosmos DB Change Feed. Fabric is simpler — changes in OneLake are reflected in AI Search automatically.