Azure AI Agent Memory: Redis Sessions & Cosmos DB Long-Term Storage

Every LLM call starts with a blank slate. The model has no recollection of anything you've said before — not this conversation, not any conversation. What feels like memory in a chat interface is actually engineering: your code maintains state and feeds it back in with each request. Here's how to build that layer properly.

Two Timescales, Two Stores

When people talk about agent memory they're usually conflating two very different things. Short-term memory is the conversation happening right now — what was said three messages ago, what the user asked for, what the agent is waiting on. Long-term memory is what you want to remember about this user across all their sessions — their preferences, their history, their account details.

These two needs have fundamentally different requirements, and that's why they live in different stores.

Memory type	Scope	Store	Expiry	Typical contents
Short-term (session)	Current conversation only	Redis (in-memory cache)	~1 hour of inactivity	All messages this session, pending task state, awaited confirmations
Long-term (persistent)	Across all sessions	Cosmos DB (NoSQL)	90 days (TTL)	User preferences, notification settings, language, support history

The hotel concierge analogy: The notepad the concierge carries during your stay is session memory — your dinner reservation tonight, your luggage in 412, your early checkout request. The guest profile in their system is long-term memory — your pillow preference, your loyalty tier, that you always want a high floor. The notepad goes in the bin when you leave. The profile is there when you return.

Short-Term Memory: Redis

Redis is an in-memory cache — it stores data in RAM, not on disk. That makes reads and writes orders of magnitude faster than any database query. For session memory, which gets read and written on potentially every LLM call, that speed matters directly to user experience.

Two things catch developers out when first using Redis for this purpose:

import redis, json

r = redis.Redis(
    host=os.getenv("REDIS_HOST"),
    port=6380,
    password=os.getenv("REDIS_KEY"),
    ssl=True
)

TTL = 3600  # 1 hour of inactivity before session expires

def save_session(user_id: str, session_id: str, messages: list) -> None:
    key = f"session:{user_id}:{session_id}"
    # Gotcha #1: Redis stores bytes, not Python objects.
    # You MUST serialize to a JSON string first. Passing a list directly
    # raises TypeError: "a bytes-like object is required, not 'list'"
    r.setex(key, TTL, json.dumps({"messages": messages}))
    # Gotcha #2: Use setex, not set.
    # set() has no expiry — sessions accumulate forever and fill up RAM.
    # setex() sets value AND TTL in one atomic operation.

def load_session(user_id: str, session_id: str) -> list:
    key = f"session:{user_id}:{session_id}"
    raw = r.get(key)
    if raw is None:
        return []  # New session — empty history
    # Deserialize back from JSON string to Python object after reading
    return json.loads(raw).get("messages", [])

The two Redis gotchas to memorise: (1) Always json.dumps() before writing and json.loads() after reading — Redis doesn't understand Python objects. (2) Always use setex, not set — setex includes a TTL. These trip up exam questions and real implementations alike.

Long-Term Memory: Cosmos DB

Cosmos DB is a globally distributed NoSQL document database. It natively stores and returns JSON, which is convenient since agents work in JSON throughout. It persists across restarts, scales globally, and supports automatic document expiry via TTL.

from azure.cosmos import CosmosClient
from azure.identity import DefaultAzureCredential

# Production: use managed identity, not connection strings
cosmos = CosmosClient(
    url=os.getenv("COSMOS_ENDPOINT"),
    credential=DefaultAzureCredential()
    # Assign: "Cosmos DB Built-in Data Contributor" role to your managed identity
)
container = cosmos.get_database_client("agent-db").get_container_client("user-memory")
# Container partition key path must be "/user_id"

def save_preferences(user_id: str, prefs: dict) -> None:
    container.upsert_item({
        "id":      user_id,   # Unique document ID
        "user_id": user_id,   # Partition key — MUST match container's key path
        "preferences": prefs,
        "ttl":     7776000    # 90 days in seconds — auto-deletes if user goes inactive
    })
    # upsert_item: creates the document if it doesn't exist, replaces it if it does.
    # One function handles both insert and update. No existence check needed.

def load_preferences(user_id: str) -> dict:
    try:
        doc = container.read_item(item=user_id, partition_key=user_id)
        return doc.get("preferences", {})
    except Exception:
        return {}  # First-time user — no preferences yet

Why partition key = user_id matters: Cosmos DB physically distributes documents across storage partitions. When you query by user_id using the partition key, Cosmos routes directly to the right partition — no cross-partition scan. Without this, every memory lookup becomes a full collection scan. This is a common exam question about Cosmos DB performance.

The Hybrid Pattern: Both Working Together

async def handle(user_id: str, session_id: str, user_msg: str) -> str:
    # Session start: load long-term preferences from Cosmos DB
    prefs    = load_preferences(user_id)
    prefs_ctx = f"User preferences: {json.dumps(prefs)}" if prefs else ""

    # Load current conversation history from Redis
    messages = load_session(user_id, session_id)
    messages.append({"role": "user", "content": user_msg})

    # Call LLM with preferences in system message, full history as context
    response = await llm.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are a helpful assistant. {prefs_ctx}"},
            *messages
        ]
    )
    reply = response.choices[0].message.content

    # Save updated history — resets the TTL clock
    messages.append({"role": "assistant", "content": reply})
    save_session(user_id, session_id, messages)

    # If we learned something new about the user, persist to Cosmos DB
    # (this logic lives in your application, not shown here)

    return reply

Managing Token Limits: Sliding Window

Conversations grow over time, and every message you add to history increases the token count of the next LLM call. At some point you'll hit the model's context limit. The sliding window approach keeps the most recent messages and drops the oldest — but there's one thing you must never drop:

def trim_history(messages: list, max_messages: int = 20) -> list:
    if len(messages) <= max_messages:
        return messages
    # ALWAYS preserve messages[0] — it's your system instruction.
    # Drop it and the agent loses its persona, safety constraints,
    # tool definitions, and every boundary you've set.
    return [messages[0]] + messages[-(max_messages - 1):]

If you need to preserve more semantic content than a window allows, use summarisation: call the LLM to compress the old conversation into a short summary, then replace the dropped messages with that summary. It costs one extra LLM call but keeps intent and key facts intact.

Redis document size limit: ~1 MB per key. Store message text, not large binary tool outputs.
Cosmos DB document limit: ~2 MB. More than enough for preference objects.
Always set TTL on Cosmos DB documents — storage that never gets cleaned up will grow indefinitely.

Agent Memory: Redis for Sessions, Cosmos DB for Life

Two Timescales, Two Stores

Short-Term Memory: Redis

Long-Term Memory: Cosmos DB

The Hybrid Pattern: Both Working Together

Managing Token Limits: Sliding Window