Every LLM call starts with a blank slate. The model has no recollection of anything you've said before — not this conversation, not any conversation. What feels like memory in a chat interface is actually engineering: your code maintains state and feeds it back in with each request. Here's how to build that layer properly.
Two Timescales, Two Stores
When people talk about agent memory they're usually conflating two very different things. Short-term memory is the conversation happening right now — what was said three messages ago, what the user asked for, what the agent is waiting on. Long-term memory is what you want to remember about this user across all their sessions — their preferences, their history, their account details.
These two needs have fundamentally different requirements, and that's why they live in different stores.
| Memory type | Scope | Store | Expiry | Typical contents |
|---|---|---|---|---|
| Short-term (session) | Current conversation only | Redis (in-memory cache) | ~1 hour of inactivity | All messages this session, pending task state, awaited confirmations |
| Long-term (persistent) | Across all sessions | Cosmos DB (NoSQL) | 90 days (TTL) | User preferences, notification settings, language, support history |
Short-Term Memory: Redis
Redis is an in-memory cache — it stores data in RAM, not on disk. That makes reads and writes orders of magnitude faster than any database query. For session memory, which gets read and written on potentially every LLM call, that speed matters directly to user experience.
Two things catch developers out when first using Redis for this purpose:
import redis, json
r = redis.Redis(
host=os.getenv("REDIS_HOST"),
port=6380,
password=os.getenv("REDIS_KEY"),
ssl=True
)
TTL = 3600 # 1 hour of inactivity before session expires
def save_session(user_id: str, session_id: str, messages: list) -> None:
key = f"session:{user_id}:{session_id}"
# Gotcha #1: Redis stores bytes, not Python objects.
# You MUST serialize to a JSON string first. Passing a list directly
# raises TypeError: "a bytes-like object is required, not 'list'"
r.setex(key, TTL, json.dumps({"messages": messages}))
# Gotcha #2: Use setex, not set.
# set() has no expiry — sessions accumulate forever and fill up RAM.
# setex() sets value AND TTL in one atomic operation.
def load_session(user_id: str, session_id: str) -> list:
key = f"session:{user_id}:{session_id}"
raw = r.get(key)
if raw is None:
return [] # New session — empty history
# Deserialize back from JSON string to Python object after reading
return json.loads(raw).get("messages", [])
json.dumps() before writing and json.loads() after reading — Redis doesn't understand Python objects. (2) Always use setex, not set — setex includes a TTL. These trip up exam questions and real implementations alike.Long-Term Memory: Cosmos DB
Cosmos DB is a globally distributed NoSQL document database. It natively stores and returns JSON, which is convenient since agents work in JSON throughout. It persists across restarts, scales globally, and supports automatic document expiry via TTL.
from azure.cosmos import CosmosClient
from azure.identity import DefaultAzureCredential
# Production: use managed identity, not connection strings
cosmos = CosmosClient(
url=os.getenv("COSMOS_ENDPOINT"),
credential=DefaultAzureCredential()
# Assign: "Cosmos DB Built-in Data Contributor" role to your managed identity
)
container = cosmos.get_database_client("agent-db").get_container_client("user-memory")
# Container partition key path must be "/user_id"
def save_preferences(user_id: str, prefs: dict) -> None:
container.upsert_item({
"id": user_id, # Unique document ID
"user_id": user_id, # Partition key — MUST match container's key path
"preferences": prefs,
"ttl": 7776000 # 90 days in seconds — auto-deletes if user goes inactive
})
# upsert_item: creates the document if it doesn't exist, replaces it if it does.
# One function handles both insert and update. No existence check needed.
def load_preferences(user_id: str) -> dict:
try:
doc = container.read_item(item=user_id, partition_key=user_id)
return doc.get("preferences", {})
except Exception:
return {} # First-time user — no preferences yet
The Hybrid Pattern: Both Working Together
async def handle(user_id: str, session_id: str, user_msg: str) -> str:
# Session start: load long-term preferences from Cosmos DB
prefs = load_preferences(user_id)
prefs_ctx = f"User preferences: {json.dumps(prefs)}" if prefs else ""
# Load current conversation history from Redis
messages = load_session(user_id, session_id)
messages.append({"role": "user", "content": user_msg})
# Call LLM with preferences in system message, full history as context
response = await llm.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"You are a helpful assistant. {prefs_ctx}"},
*messages
]
)
reply = response.choices[0].message.content
# Save updated history — resets the TTL clock
messages.append({"role": "assistant", "content": reply})
save_session(user_id, session_id, messages)
# If we learned something new about the user, persist to Cosmos DB
# (this logic lives in your application, not shown here)
return reply
Managing Token Limits: Sliding Window
Conversations grow over time, and every message you add to history increases the token count of the next LLM call. At some point you'll hit the model's context limit. The sliding window approach keeps the most recent messages and drops the oldest — but there's one thing you must never drop:
def trim_history(messages: list, max_messages: int = 20) -> list:
if len(messages) <= max_messages:
return messages
# ALWAYS preserve messages[0] — it's your system instruction.
# Drop it and the agent loses its persona, safety constraints,
# tool definitions, and every boundary you've set.
return [messages[0]] + messages[-(max_messages - 1):]
If you need to preserve more semantic content than a window allows, use summarisation: call the LLM to compress the old conversation into a short summary, then replace the dropped messages with that summary. It costs one extra LLM call but keeps intent and key facts intact.
- Redis document size limit: ~1 MB per key. Store message text, not large binary tool outputs.
- Cosmos DB document limit: ~2 MB. More than enough for preference objects.
- Always set TTL on Cosmos DB documents — storage that never gets cleaned up will grow indefinitely.