LangGraph Chapter 6 — Memory: In-Context, Session & Long-Term
Senior Architect Interview Series — LangGraph & Agentic AI
Navigation
← Chapter 5 — Routing | Chapter 7 — Multi-Agent →
6.0 What This Chapter Covers
Memory is what separates a truly useful agent from a stateless answering machine. This chapter covers:
- The four levels of agent memory with specific examples
- How your project implements each layer
- In-context memory:
AgentState.messagesand the context window - Session memory: PostgreSQL
ChatHistoryviamemory.py - Long-term memory: cross-session storage and retrieval
- Semantic memory: RAG as a form of external agent memory
- LangGraph's built-in checkpointing system
6.1 The Four Levels of Agent Memory
┌─────────────────────────────────────────────────────────────────┐
│ MEMORY HIERARCHY │
│ │
│ L1 — IN-CONTEXT AgentState.messages (current execution) │
│ ├─ Lifespan: One graph invocation │
│ ├─ Capacity: LLM context window (~128K tokens) │
│ └─ Location: RAM, in the AgentState dict │
│ │
│ L2 — SESSION ChatHistory table (PostgreSQL) │
│ ├─ Lifespan: One chat session (hours/days) │
│ ├─ Capacity: Unlimited (last N messages loaded) │
│ └─ Location: Your relational database │
│ │
│ L3 — LONG-TERM Persistent user preferences/facts │
│ ├─ Lifespan: Cross-session, potentially forever │
│ ├─ Capacity: Unlimited │
│ └─ Location: Database with semantic search │
│ │
│ L4 — SEMANTIC ChromaDB vector store (RAG) │
│ ├─ Lifespan: Persistent, updated via ingestion │
│ ├─ Capacity: Scales to millions of documents │
│ └─ Location: chroma_db/ (your project) │
└─────────────────────────────────────────────────────────────────┘
6.2 L1 — In-Context Memory: AgentState.messages
In-context memory is the messages currently loaded in AgentState. This is what the LLM "sees" in its context window during one graph invocation.
What Goes In Here
# From run_agent() in agent/agent.py
history = load_history(session_id, db) # L2 → L1: load from DB
history.append(HumanMessage(content=question)) # add this turn's question
initial_state = {"messages": history}
final_state = agent.invoke(initial_state) # L1 exists for this call
# After return: L1 is discarded
During the invocation:
- The full history from the database is loaded into
AgentState.messages - Tool results get appended (ToolMessages)
- LLM responses get appended (AIMessages)
- At the end, the last message is the answer
After return: The AgentState object ceases to exist — so does L1 memory. The answer is saved back to the database (L2).
Context Window Limits
Modern LLMs support large contexts (~128K tokens for gpt-4o-mini) but:
- More tokens = more cost (charged per token)
- More tokens = slower inference (attention is quadratic in tokens)
- Quality can degrade with very long contexts ("lost in the middle" problem)
Production strategy: Load only the last N turns from the database:
def load_history(session_id: str, db: Session, max_turns: int = 10) -> list[BaseMessage]:
rows = (
db.query(ChatHistory)
.filter(ChatHistory.session_id == session_id)
.order_by(ChatHistory.id.desc()) # newest first
.limit(max_turns * 2) # N turns = N*2 rows (human + ai)
.all()
)
rows.reverse() # back to chronological order
messages = []
for row in rows:
messages.append(HumanMessage(content=row.user_message))
messages.append(AIMessage(content=row.assistant_message))
return messages
6.3 L2 — Session Memory: Your PostgreSQL Implementation
Session memory persists the conversation across calls within the same session. Your project implements this with a ChatHistory SQLAlchemy model stored in PostgreSQL.
The Model (models.py)
# Typical ChatHistory model in the project
class ChatHistory(Base):
__tablename__ = "chat_history"
id = Column(Integer, primary_key=True, index=True)
session_id = Column(String, index=True)
user_message = Column(Text)
assistant_message = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
Loading History (memory.py)
def load_history(session_id: str, db: Session) -> list[BaseMessage]:
rows = (
db.query(ChatHistory)
.filter(ChatHistory.session_id == session_id)
.order_by(ChatHistory.id) # chronological order
.all()
)
messages = []
for row in rows:
messages.append(HumanMessage(content=row.user_message))
messages.append(AIMessage(content=row.assistant_message))
return messages
This converts database rows into LangChain BaseMessage objects that LangGraph can include in AgentState.messages.
Saving History (memory.py)
def save_history(
session_id: str,
question: str,
answer: str,
db: Session
) -> None:
row = ChatHistory(
session_id=session_id,
user_message=question,
assistant_message=answer
)
db.add(row)
db.commit()
After the agent returns, run_agent() calls save_history() to persist the new turn:
# From run_agent():
final_state = agent.invoke({"messages": history})
answer = final_state["messages"][-1].content
save_history(session_id, question, answer, db) # persist to L2
return answer
Session Lifecycle
First Request (new session_id):
load_history() → [] (empty)
append HumanMessage("hello")
agent.invoke → AIMessage("Hi there!")
save_history("hello", "Hi there!")
DB: [{session_id, "hello", "Hi there!"}]
Second Request (same session_id):
load_history() → [HumanMessage("hello"), AIMessage("Hi there!")]
append HumanMessage("who are you?")
agent.invoke → sees full history → contextual reply
save_history("who are you?", "I'm the Agent Factory assistant...")
DB: [{...}, {session_id, "who are you?", "I'm..."}]
6.4 L3 — Long-Term Memory: Cross-Session Persistence
Long-term memory stores facts about the user that persist indefinitely across sessions — their preferences, past decisions, important context.
Design Pattern — User Profile Store
class UserMemory(Base):
__tablename__ = "user_memory"
id = Column(Integer, primary_key=True)
user_id = Column(String, index=True)
key = Column(String) # e.g., "preferred_format", "team", "last_project"
value = Column(Text)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# Usage:
def get_user_context(user_id: str, db: Session) -> str:
memories = db.query(UserMemory).filter(UserMemory.user_id == user_id).all()
if not memories:
return ""
return "\n".join(f"- {m.key}: {m.value}" for m in memories)
Injecting Long-Term Memory Into AgentState
# In run_agent():
user_context = get_user_context(user_id, db)
if user_context:
# Add as SystemMessage at the start of context
history.insert(0, SystemMessage(content=f"User context:\n{user_context}"))
Auto-Extracting Memories
A memory extractor node can automatically save new facts:
def extract_and_save_memories(state: AgentState, user_id: str, db: Session):
"""After each turn, check if any new user facts were revealed."""
extraction_prompt = f"""Extract key facts about the user from this conversation.
Only extract concrete facts: preferences, team, role, projects, constraints.
Return as JSON: {{"key": "value"}} pairs, or empty dict if nothing new.
Conversation: {format_messages(state["messages"])}"""
response = llm.invoke([HumanMessage(content=extraction_prompt)])
try:
facts = json.loads(response.content)
for key, value in facts.items():
upsert_user_memory(user_id, key, value, db)
except json.JSONDecodeError:
pass # skip if LLM didn't return valid JSON
6.5 L4 — Semantic Memory: ChromaDB / RAG
Your ChromaDB vector store at chroma_db/ is a form of external semantic memory for the agent — it stores domain knowledge that the agent can retrieve on demand.
# From rag/retrieve.py
def retrieve(query: str) -> list[str]:
"""Retrieve relevant chunks from ChromaDB."""
collection = get_collection()
results = collection.query(
query_texts=[query],
n_results=5,
include=["documents", "distances"]
)
return results["documents"][0] # list of relevant text chunks
def build_prompt(query: str, results: list[str]) -> str:
"""Format retrieved chunks as context for the LLM."""
context = "\n---\n".join(results)
return f"""Context from knowledge base:
{context}
Question: {query}"""
This is called by the rag_search tool:
@tool
def rag_search(query: str) -> str:
"""Search the knowledge base for Agent Factory information."""
results = retrieve(query)
return build_prompt(query, results)
The mental model: ChromaDB is the agent's long-term semantic memory. When the agent needs to know something about Agent Factory, it "recalls" related passages by similarity search — exactly as humans recall relevant knowledge from memory.
6.6 LangGraph's Built-In Checkpointing
LangGraph has a built-in memory system via checkpointers — they automatically save the graph state at each step, enabling:
- Resumable conversations without database code
- Human-in-the-loop interruptions (Chapter 8)
- Fault tolerance (resume after crash)
Setup
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.checkpoint.postgres import PostgresSaver
# SQLite (development)
checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
# PostgreSQL (production)
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
# Attach to graph at compile time
agent = graph.compile(checkpointer=checkpointer)
Using Checkpointed Memory
# Each thread_id is a separate conversation
config = {"configurable": {"thread_id": session_id}}
# First call — starts a new thread
result1 = agent.invoke(
{"messages": [HumanMessage("hello")]},
config=config
)
# Second call — same thread_id, LangGraph auto-loads history!
result2 = agent.invoke(
{"messages": [HumanMessage("who are you?")]},
config=config
)
# No manual load_history() needed — checkpointer handles it
Checkpointer vs. Your Custom memory.py
| Feature | LangGraph Checkpointer | Your memory.py |
|---|---|---|
| Code required | Minimal | Custom load/save functions |
| Storage format | Full serialized state | Human/AI message pairs |
| Works with HITL | Built-in | Manual implementation |
| Portability | LangGraph-specific | Portable to any system |
| Customization | Limited | Full control |
| Schema migrations | Auto | Manual |
For most production use cases, switching to PostgresSaver checkpointer eliminates the need for custom memory.py code.
6.7 Migrating memory.py to LangGraph Checkpointer
Here's how to upgrade your project to use the built-in checkpointer:
# config.py — add checkpointer setup
from langgraph.checkpoint.postgres import PostgresSaver
from database import DATABASE_URL
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
checkpointer.setup() # creates checkpoint tables
# agent/agent.py — compile with checkpointer
from config import checkpointer
agent = graph.compile(checkpointer=checkpointer)
# main.py / run_agent() — simpler memory management
def run_agent(question: str, session_id: str) -> str:
# No need for load_history() or save_history()!
config = {"configurable": {"thread_id": session_id}}
# Guardrails still apply
if not check_input(question):
return "I can't assist with that request."
# Checkpointer auto-loads and saves conversation history
result = agent.invoke(
{"messages": [HumanMessage(content=question)]},
config=config
)
return result["messages"][-1].content
6.8 Memory Windowing — Managing Context Costs
For long conversations, you need to limit context window usage:
Strategy 1 — Load Last N Turns
def load_history(session_id: str, db: Session, max_turns: int = 10):
rows = db.query(ChatHistory).filter(...).order_by(desc).limit(max_turns * 2).all()
# ...
Strategy 2 — Summarize Old Turns
def summarize_history(messages: list[BaseMessage]) -> list[BaseMessage]:
"""Compress old messages into a summary SystemMessage."""
if len(messages) <= 10:
return messages # no compression needed
old_messages = messages[:-10] # messages to summarize
recent_messages = messages[-10:] # keep last 10 verbatim
summary_prompt = f"""Summarize this conversation:\n{format_messages(old_messages)}"""
summary = llm.invoke([HumanMessage(content=summary_prompt)]).content
return [
SystemMessage(content=f"Previous conversation summary: {summary}"),
*recent_messages
]
Strategy 3 — Token Budget
import tiktoken
def trim_to_token_budget(messages: list[BaseMessage], budget: int = 8000):
"""Keep the most recent messages within a token budget."""
enc = tiktoken.encoding_for_model("gpt-4o-mini")
total_tokens = 0
kept = []
for msg in reversed(messages):
tokens = len(enc.encode(str(msg.content)))
if total_tokens + tokens > budget:
break
kept.insert(0, msg)
total_tokens += tokens
return kept
6.9 Interview Q&A
Q: Explain the memory architecture in your LangGraph agent. What are the different layers?
Our agent has four memory layers. L1 — in-context memory is the
AgentState.messageslist during one graph invocation: the full history plus tool messages plus LLM responses. It exists only for the duration of the invocation. L2 — session memory is PostgreSQL'sChatHistorytable, accessed viaload_history()andsave_history()inmemory.py. Messages from prior turns in the same session are loaded into L1 at the start of each invocation. L3 would be long-term cross-session memory (user preferences, etc.) — not yet implemented but would use a similar DB table queryed by user_id. L4 — semantic memory is ChromaDB, the vector store holding Agent Factory documentation, retrieved on demand via therag_searchtool.
Q: Why did you implement custom memory.py instead of using LangGraph's built-in checkpointer?
The custom
memory.pygives us a clean human-readablechat_historytable that's easy to query, audit, and display in a chat UI. The stored pairs (user_message, assistant_message) are directly usable by the REST API without deserialization. LangGraph's checkpointer stores the complete serialized state, which is harder to inspect or integrate with non-LangGraph clients. For future iterations,PostgresSaverwould replace the custom code — but the custom approach gives us full schema control right now.
Q: What happens if the session has 1000 turns? How do you handle context window limits?
We'd implement windowing in
load_history. The simplest approach is loading only the last N turns (e.g., 10). A more sophisticated approach is summarization: compress older turns into a SystemMessage summary, keep recent turns verbatim. Token counting viatiktokengives us the most precise control. In production, we'd also store a running summary in the database so we don't need to re-summarize on every request.
Q: How does RAG relate to agent memory?
RAG is the agent's semantic long-term memory — external knowledge that the agent can retrieve by similarity rather than exact lookup. Unlike session memory (time-ordered conversation), semantic memory allows cross-document retrieval: the agent can find relevant information from thousands of documents instantly. In our system, ChromaDB stores Agent Factory documentation. When the agent needs domain knowledge, it calls
rag_search, which is conceptually the same as "recalling" relevant information from semantic memory. The difference between RAG and an LLM's parametric memory (training knowledge) is that RAG is dynamic — you can add new documents without retraining.
Q: How would you implement cross-session long-term memory?
I'd add a
UserMemorytable withuser_id,key, andvaluecolumns. After each agent turn, a memory extraction step would call the LLM to identify new facts about the user (role, preferences, project context). These facts would be upserted intoUserMemory. At the start of each session,get_user_context(user_id)would query these facts and inject them as a SystemMessage into the initial context. An optional embedding-based retrieval layer would scale this to thousands of memory items per user by finding the most relevant ones for each new question.
6.10 Key One-Liners to Memorize
"Four memory layers: in-context (RAM), session (DB), long-term (persistent), semantic (vector store)."
"load_history() → L2 → L1 at start; save_history() → L1 → L2 at end."
"The LLM sees AgentState.messages — put relevant history there before invoking."
"Window the history: load last N turns to control cost and latency."
"LangGraph checkpointer = auto-save state after every node, keyed by thread_id."
"RAG is the agent's semantic long-term memory — retrieve by meaning, not by lookup."
Next: Chapter 7 — Multi-Agent: Supervisor Pattern & Handoffs