One of the most expensive mistakes you can make early in Azure AI development is treating all models the same. Route everything through GPT-5 and you're spending 50x what you need to. Use a small model for everything and complex reasoning falls apart. The right architecture almost always involves both — used strategically.
What "Large" and "Small" Actually Mean
The terms LLM and SLM come down to parameter count — the number of internal weights a model learned during training. More parameters generally means more reasoning capacity, but also more compute required per inference, which translates directly to latency and cost.
| Model | Type | Parameters (approx) | Latency | Best suited for |
|---|---|---|---|---|
| GPT-5 / GPT-4o | LLM | ~1 trillion | 500–2,000ms | Complex reasoning, orchestration, planning |
| Phi-4 Medium | SLM boundary | ~18 billion | 200–400ms | Balanced capability and speed |
| Phi-4 Small | SLM | ~7 billion | 100–200ms | Summarisation, extraction, classification |
| Phi-4 Mini | SLM | ~3.8 billion | 50–150ms | Ultra-fast routing, simple yes/no decisions |
The Latency Problem Nobody Expects
Here's the calculation that surprises most developers. Say you build a multi-step agent that makes 10 sequential LLM calls — routing decisions, tool interpretations, response formatting. At a typical GPT-4o latency of 2 seconds per call, that's 20 seconds of user-perceived wait time. Most users assume something is broken before 15 seconds.
Replace those 10 calls with Phi-4 Mini calls at 150ms each and you're at 1.5 seconds total. Same task, 13x faster, at a fraction of the cost. This is why sub-agent calls in multi-agent architectures should default to SLMs unless the task genuinely requires GPT-level reasoning.
The Hybrid Routing Pattern
The architecture that works in production: use an SLM to classify every incoming request, then route to the appropriate model. The SLM handles the decision cheaply and fast; the LLM only sees the work it's actually equipped for.
import os, requests
SLM = os.getenv("PHI4_MINI_ENDPOINT")
LLM = os.getenv("GPT5_ENDPOINT")
HDRS = {"api-key": os.getenv("FOUNDRY_API_KEY"), "Content-Type": "application/json"}
def classify(message: str) -> str:
"""SLM decides: 'simple' or 'complex'. Runs on every single request."""
resp = requests.post(SLM, headers=HDRS, json={
"messages": [
{"role": "system", "content":
"Reply with exactly one word: 'simple' or 'complex'. "
"Simple = greetings, FAQs, status checks. "
"Complex = multi-step tasks, planning, calculations."},
{"role": "user", "content": message}
],
"max_tokens": 5,
"temperature": 0.1
})
return resp.json()["choices"][0]["message"]["content"].strip().lower()
def respond(message: str) -> str:
if classify(message) == "simple":
endpoint, system = SLM, "You are a helpful assistant. Answer concisely."
else:
endpoint, system = LLM, "Think step by step. Use tools as needed."
resp = requests.post(endpoint, headers=HDRS, json={
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": message}
]
})
return resp.json()["choices"][0]["message"]["content"]
In well-tuned systems, 85–90% of traffic routes to the SLM. Only genuinely complex requests reach the expensive model. The cost difference at production scale is not marginal — it's an order of magnitude.
Parameters That Matter: Temperature and top_p
Temperature controls how deterministic vs creative the output is. At 0, the model picks the highest-probability next token every time — you get consistent, repeatable results. At 1+, it samples more broadly — outputs are varied, sometimes surprisingly good, sometimes inconsistent.
top_p controls the diversity of the token pool the model samples from. At 0.1, it only considers the top 10% most likely tokens. At 1.0, the full distribution is in play.
# Classification: you want the same answer every time
{
"temperature": 0.1, # Near-deterministic
"top_p": 0.9,
"max_tokens": 10 # Strict output cap for routing decisions
}
# Creative writing: you want interesting variation
{
"temperature": 1.2, # More creative
"top_p": 0.95,
"max_tokens": 500
}
temperature and top_p to 0 simultaneously causes severe latency — observed at 8x slower in practice. Use temperature ~0.1 for deterministic tasks, not a hard zero, and leave top_p above 0.8.TPM Quotas and HTTP 429
When you deploy a model in Foundry, you allocate a tokens-per-minute (TPM) quota. Exceed it and you get an HTTP 429 response — not a 404, not a 503. This is a rate limit, not a server error. Handle it with exponential backoff:
import time, requests
def call_with_retry(endpoint: str, payload: dict, headers: dict) -> dict:
for attempt in range(3):
resp = requests.post(endpoint, json=payload, headers=headers)
if resp.status_code == 200:
return resp.json()
if resp.status_code == 429:
time.sleep(2 ** attempt) # 1s, 2s, 4s
continue
resp.raise_for_status()
raise RuntimeError("Model call failed after 3 retries")
Monitor your TPM usage in Foundry metrics. If you're hitting 429s during normal traffic, request a quota increase via the Azure portal — each region has per-model limits that can be raised.