Your AI Agent Forgets Everything After 50 Messages. Here's the Fix.
Table of Contents
Run a Pydantic AI agent for 50 messages and ask it about something from message #3. It won’t remember. Not because the model is bad - because the conversation got too long, the context window filled up, and the early messages fell off.
This isn’t a model problem. It’s an infrastructure problem. And it gets worse with agents that use tools, because every tool call and response is a message. A 20-turn conversation with 3 tool calls per turn is actually 80+ messages. At ~4 characters per token, that’s tens of thousands of tokens consumed before you know it.
TL;DR
- Context overflow is an infrastructure problem, not a model problem. Every long-running agent needs a context management strategy.
- SummarizationProcessor uses an LLM to intelligently compress old messages into a summary - higher quality, but costs an API call per trigger.
- SlidingWindowProcessor simply trims old messages for zero cost - fast but loses context entirely.
- Tool call pair preservation is critical. Both processors ensure tool calls and their responses are never split apart.
- Fraction-based triggers (
("fraction", 0.8)) are the most portable option - they adapt automatically to any model’s context window.
We hit this wall in every long-running agent deployment. Customer support bots that forget the customer’s name. Code assistants that lose track of which files they’ve already modified. Research agents that re-investigate topics they covered 30 messages ago.
So we built summarization-pydantic-ai - two processors that manage your agent’s conversation history: one that summarizes intelligently using an LLM, and one that simply trims old messages for zero cost.
Two Processors, Two Trade-offs
| Aspect | SummarizationProcessor | SlidingWindowProcessor |
|---|---|---|
| Cost | LLM API call per trigger | Zero |
| Latency | Depends on model | ~0ms |
| Context loss | Minimal (intelligently summarized) | Complete (old messages gone) |
| Default trigger | 170,000 tokens | 100 messages |
| Default keep | 20 messages | 50 messages |
| Best for | Quality-critical agents | Speed/cost-critical agents |
Both work as Pydantic AI history processors - drop-in functions that transform the message history before each agent run.
SummarizationProcessor: Intelligent Compression
from pydantic_ai import Agentfrom pydantic_ai_summarization import create_summarization_processor
processor = create_summarization_processor( model="openai:gpt-4.1", trigger=("tokens", 100000), # Trigger at 100k tokens keep=("messages", 20), # Keep last 20 messages)
agent = Agent( "openai:gpt-4.1", history_processors=[processor],)When the token count exceeds the trigger, the processor:
- Calculates a safe cutoff point (never splitting tool call pairs)
- Sends the older messages to an LLM for summarization
- Replaces the old messages with a compact summary
- Keeps the last 20 messages intact
The result: your agent maintains context from early in the conversation without consuming the full token budget.
Three Trigger Types
You can trigger summarization based on messages, tokens, or context fraction:
Message-based - simple and predictable:
processor = SummarizationProcessor( model="openai:gpt-4.1", trigger=("messages", 50), # After 50 messages)Token-based - accounts for message length:
processor = SummarizationProcessor( model="openai:gpt-4.1", trigger=("tokens", 100000), # After 100k tokens)Fraction-based - adapts to any model’s context:
processor = SummarizationProcessor( model="openai:gpt-4.1", trigger=("fraction", 0.8), # At 80% of context max_input_tokens=128000, # GPT-4o's context window)Multiple triggers - OR logic, first one wins:
processor = SummarizationProcessor( model="openai:gpt-4.1", trigger=[ ("messages", 50), # OR ("tokens", 100000), # OR ("fraction", 0.8), ], max_input_tokens=128000,)Fraction-based triggers are the most robust for production - they adapt automatically when you switch between models with different context windows.
Tool Call Pair Preservation
This is the detail most context management solutions get wrong. Consider this message sequence:
User: "Search for Python tutorials"Assistant: [tool_call: search("Python tutorials"), id=call_1]Tool: [tool_return: "Found 5 results...", id=call_1]Assistant: "Here are the top results..."User: "Tell me more about the first one"If you cut between the tool call and its return, the model sees an orphaned tool call - and breaks. Our processors handle this:
def _is_safe_cutoff_point(self, messages, cutoff_index): """Check if cutting at index would separate tool call/response pairs.""" search_start = max(0, cutoff_index - 5) search_end = min(len(messages), cutoff_index + 5)
for i in range(search_start, search_end): msg = messages[i] if not isinstance(msg, ModelResponse): continue
tool_call_ids = set() for part in msg.parts: if isinstance(part, ToolCallPart) and part.tool_call_id: tool_call_ids.add(part.tool_call_id)
if not tool_call_ids: continue
# Check if cutoff separates this tool call from its response for j in range(i + 1, len(messages)): check_msg = messages[j] if isinstance(check_msg, ModelRequest): for part in check_msg.parts: if (isinstance(part, ToolReturnPart) and part.tool_call_id in tool_call_ids): tool_before = i < cutoff_index response_before = j < cutoff_index if tool_before != response_before: return False # Unsafe - would split pair return TrueThe processor searches 5 messages in either direction from the cutoff point, finds all tool call/return pairs, and ensures they stay on the same side of the cut.
SlidingWindowProcessor: Zero-Cost Trimming
When you don’t need intelligent summarization - when speed and cost matter more than context quality:
from pydantic_ai_summarization import create_sliding_window_processor
processor = create_sliding_window_processor( trigger=("messages", 100), # Trim at 100 messages keep=("messages", 50), # Keep last 50)
agent = Agent( "openai:gpt-4.1", history_processors=[processor],)No LLM call. No latency. Old messages are simply discarded. The same tool call pair preservation applies - the sliding window won’t cut in the middle of a tool call sequence.
Token Counting: Approximate by Default
The default token counter uses a simple heuristic - ~4 characters per token:
def count_tokens_approximately(messages): total_chars = 0 for msg in messages: if isinstance(msg, ModelRequest): for part in msg.parts: if isinstance(part, UserPromptPart): if isinstance(part.content, str): total_chars += len(part.content) elif isinstance(part, SystemPromptPart): total_chars += len(part.content) elif isinstance(part, ToolReturnPart): total_chars += len(str(part.content)) elif isinstance(msg, ModelResponse): for part in msg.parts: if isinstance(part, TextPart): total_chars += len(part.content) elif isinstance(part, ToolCallPart): total_chars += len(part.tool_name) total_chars += len(str(part.args)) return total_chars // 4This is fast and good enough for most cases. For precision, plug in tiktoken:
import tiktoken
def accurate_counter(messages): encoding = tiktoken.encoding_for_model("gpt-4") total = 0 for msg in messages: total += len(encoding.encode(str(msg))) return total
processor = create_summarization_processor( token_counter=accurate_counter, trigger=("tokens", 100000),)Custom Summary Prompts
The default summary prompt extracts key context. You can customize it:
processor = create_summarization_processor( summary_prompt=""" You are summarizing an agent conversation. Extract:
1. **Key Decisions**: What was decided? 2. **Code Changes**: What code was written/modified? 3. **Pending Tasks**: What still needs to be done? 4. **Important Context**: What context is crucial to preserve?
Conversation to summarize: {messages}
Provide a concise summary that preserves essential information. """,)The {messages} placeholder is replaced with the formatted message history. Customize this for your domain - a customer support agent might extract customer name and issue, while a code assistant might focus on file changes and test results.
Conversation Loop Pattern
Here’s the typical production pattern:
from pydantic_ai import Agentfrom pydantic_ai_summarization import create_summarization_processor
processor = create_summarization_processor( trigger=("messages", 20), keep=("messages", 5),)
agent = Agent( "openai:gpt-4.1", history_processors=[processor],)
async def chat(): message_history = []
while True: user_input = input("You: ") if user_input.lower() == "quit": break
result = await agent.run( user_input, message_history=message_history, )
print(f"Assistant: {result.output}") message_history = result.all_messages()The history processor runs automatically before each agent.run() call. You pass the full message history, the processor checks if summarization is needed, and returns a (possibly compressed) history. No manual token counting required.
Key Takeaways
- Context overflow is an infrastructure problem, not a model problem. Every long-running agent needs a context management strategy. Choose between intelligent summarization (higher quality, LLM cost) and sliding window (zero cost, context loss).
- Tool call pair preservation is critical. Never cut between a tool call and its response. Both our processors handle this automatically - search 5 messages in each direction for orphaned pairs.
- Fraction-based triggers are the most portable.
("fraction", 0.8)works regardless of which model you use or how long your messages are. Token and message counts are model- and content-specific. - The ~4 chars/token heuristic is good enough. For production precision, use
tiktoken. For everything else, the default counter keeps things fast without adding a dependency. - Customize the summary prompt for your domain. The default extracts generic context. A domain-specific prompt (“extract customer name, issue ID, and resolution status”) produces much better summaries.
Try It Yourself
summarization-pydantic-ai - Automatic conversation summarization and context management for Pydantic AI agents.
pip install summarization-pydantic-aiRelated Articles
From create-react-app to create-ai-app: The New Default for AI Applications
In 2016, create-react-app standardized how we build frontends. In 2026, AI applications need the same moment — and it's...
AGENTS.md: Making Your Codebase AI-Agent Friendly (Copilot, Cursor, Codex, Claude Code)
Every AI coding tool reads your repo differently. Here's how AGENTS.md — the emerging tool-agnostic standard — gives the...
From 0 to Production AI Agent in 30 Minutes — Full-Stack Template with 5 AI Frameworks
Step-by-step walkthrough: web configurator, pick a preset, choose your AI framework, configure 75+ options, docker-compo...