LLMs vs SLMs: Choosing the Right Azure AI Model for Your Agent

One of the most expensive mistakes you can make early in Azure AI development is treating all models the same. Route everything through GPT-5 and you're spending 50x what you need to. Use a small model for everything and complex reasoning falls apart. The right architecture almost always involves both — used strategically.

What "Large" and "Small" Actually Mean

The terms LLM and SLM come down to parameter count — the number of internal weights a model learned during training. More parameters generally means more reasoning capacity, but also more compute required per inference, which translates directly to latency and cost.

Model	Type	Parameters (approx)	Latency	Best suited for
GPT-5 / GPT-4o	LLM	~1 trillion	500–2,000ms	Complex reasoning, orchestration, planning
Phi-4 Medium	SLM boundary	~18 billion	200–400ms	Balanced capability and speed
Phi-4 Small	SLM	~7 billion	100–200ms	Summarisation, extraction, classification
Phi-4 Mini	SLM	~3.8 billion	50–150ms	Ultra-fast routing, simple yes/no decisions

A nuance worth knowing: Parameter count isn't everything. The Phi series is trained on high-quality synthetic data and carefully curated text — not just a raw web crawl. That data quality means Phi-4 punches above its weight for structured tasks like classification and extraction. Don't dismiss it because it's small.

The Latency Problem Nobody Expects

Here's the calculation that surprises most developers. Say you build a multi-step agent that makes 10 sequential LLM calls — routing decisions, tool interpretations, response formatting. At a typical GPT-4o latency of 2 seconds per call, that's 20 seconds of user-perceived wait time. Most users assume something is broken before 15 seconds.

Replace those 10 calls with Phi-4 Mini calls at 150ms each and you're at 1.5 seconds total. Same task, 13x faster, at a fraction of the cost. This is why sub-agent calls in multi-agent architectures should default to SLMs unless the task genuinely requires GPT-level reasoning.

Think of it as a relay race: You don't put your fastest sprinter on every leg of a 10-leg relay. You save them for the critical anchor leg. Use SLMs for the legs where speed matters; use GPT-5 for the leg where you need to win on quality.

The Hybrid Routing Pattern

The architecture that works in production: use an SLM to classify every incoming request, then route to the appropriate model. The SLM handles the decision cheaply and fast; the LLM only sees the work it's actually equipped for.

import os, requests

SLM  = os.getenv("PHI4_MINI_ENDPOINT")
LLM  = os.getenv("GPT5_ENDPOINT")
HDRS = {"api-key": os.getenv("FOUNDRY_API_KEY"), "Content-Type": "application/json"}

def classify(message: str) -> str:
    """SLM decides: 'simple' or 'complex'. Runs on every single request."""
    resp = requests.post(SLM, headers=HDRS, json={
        "messages": [
            {"role": "system", "content":
             "Reply with exactly one word: 'simple' or 'complex'. "
             "Simple = greetings, FAQs, status checks. "
             "Complex = multi-step tasks, planning, calculations."},
            {"role": "user", "content": message}
        ],
        "max_tokens": 5,
        "temperature": 0.1
    })
    return resp.json()["choices"][0]["message"]["content"].strip().lower()

def respond(message: str) -> str:
    if classify(message) == "simple":
        endpoint, system = SLM, "You are a helpful assistant. Answer concisely."
    else:
        endpoint, system = LLM, "Think step by step. Use tools as needed."

    resp = requests.post(endpoint, headers=HDRS, json={
        "messages": [
            {"role": "system", "content": system},
            {"role": "user",   "content": message}
        ]
    })
    return resp.json()["choices"][0]["message"]["content"]

In well-tuned systems, 85–90% of traffic routes to the SLM. Only genuinely complex requests reach the expensive model. The cost difference at production scale is not marginal — it's an order of magnitude.

Parameters That Matter: Temperature and top_p

Temperature controls how deterministic vs creative the output is. At 0, the model picks the highest-probability next token every time — you get consistent, repeatable results. At 1+, it samples more broadly — outputs are varied, sometimes surprisingly good, sometimes inconsistent.

top_p controls the diversity of the token pool the model samples from. At 0.1, it only considers the top 10% most likely tokens. At 1.0, the full distribution is in play.

# Classification: you want the same answer every time
{
    "temperature": 0.1,  # Near-deterministic
    "top_p": 0.9,
    "max_tokens": 10     # Strict output cap for routing decisions
}

# Creative writing: you want interesting variation
{
    "temperature": 1.2,  # More creative
    "top_p": 0.95,
    "max_tokens": 500
}

Common pitfall: Setting both temperature and top_p to 0 simultaneously causes severe latency — observed at 8x slower in practice. Use temperature ~0.1 for deterministic tasks, not a hard zero, and leave top_p above 0.8.

TPM Quotas and HTTP 429

When you deploy a model in Foundry, you allocate a tokens-per-minute (TPM) quota. Exceed it and you get an HTTP 429 response — not a 404, not a 503. This is a rate limit, not a server error. Handle it with exponential backoff:

import time, requests

def call_with_retry(endpoint: str, payload: dict, headers: dict) -> dict:
    for attempt in range(3):
        resp = requests.post(endpoint, json=payload, headers=headers)
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code == 429:
            time.sleep(2 ** attempt)  # 1s, 2s, 4s
            continue
        resp.raise_for_status()
    raise RuntimeError("Model call failed after 3 retries")

Monitor your TPM usage in Foundry metrics. If you're hitting 429s during normal traffic, request a quota increase via the Azure portal — each region has per-model limits that can be raised.

LLMs vs SLMs: Choosing the Right Model for Your Agent

What "Large" and "Small" Actually Mean

The Latency Problem Nobody Expects

The Hybrid Routing Pattern

Parameters That Matter: Temperature and top_p

TPM Quotas and HTTP 429