Context Engineering: The New "Prompt Engineering" for AI Agents

In the early days of generative AI, the focus was on "prompt engineering"—the art of crafting the perfect query to get a brilliant answer. But as we move from simple chatbots to autonomous multi-agent systems, a new, more critical discipline has emerged: Context Engineering.

Context engineering is the "delicate art and science of filling the context window with just the right information for the next step". It is no longer just about what you ask the model; it is about designing the entire information ecosystem—the memory, state, and tools—that surrounds the model.

Here is how to master context management to build reliable, production-grade AI systems.

The Problem: Context is a Finite, Dangerous Resource

While modern LLMs boast massive context windows (up to millions of tokens), treating the context window as a "dumping ground" is a recipe for failure. The computational cost of attention mechanisms scales quadratically, meaning doubling the context size quadruples the compute required.

Furthermore, poor context management leads to specific failure modes:

Context Rot: As input tokens increase, the model's ability to recall information degrades, often losing critical data "in the middle" of the prompt.
Context Poisoning: If an agent hallucinates or ingests incorrect data, that error is recorded in its memory and reinforced in subsequent turns, sending the agent down a nonsensical path.
Context Distraction: Overloading the window with irrelevant tools or documents causes the model to fixate on history rather than its training or reasoning capabilities.
Context Clash: When contradictory pieces of information (e.g., conflicting flight times from two different API calls) exist in the same window, the model may hallucinate to reconcile them.

The Framework: Write, Select, Compress, Isolate

To prevent these failures, industry leaders (including Anthropic and Kubiya) have coalesced around a four-part strategy for managing context: Write, Select, Compress, and Isolate.

Blog image

1. Write: Offload Memory

Do not rely on the context window for storage. Instead, write information to external systems.

Scratchpads: Force agents to "think" on a scratchpad—an intermediate working memory where they record plans and calculations. This avoids regenerating the same reasoning steps.
Long-Term Memory: Store user preferences and historical patterns in vector databases (like Elasticsearch or Pinecone) or SQL databases.
Checkpoints: Use frameworks like LangGraph to save the state of an agent’s graph, allowing it to resume workflows after interruptions without holding everything in active memory.

2. Select: Just-in-Time Retrieval

The goal is to retrieve only the tokens necessary for the current task.

RAG (Retrieval-Augmented Generation): Rather than loading entire documents, retrieve only the specific chunks relevant to the query.
Tool Loadout: Do not dump every available tool definition into the system prompt. Use semantic search to dynamically select the 3–5 tools relevant to the current user request.
Dynamic In-Context Planning (DIP): Adjust the in-context cues and prompts on the fly based on evolving task demands.

3. Compress: Increase Information Density

When context gets too long, you must reduce volume without losing signal.

Summarization: Periodically compress the last $N$ turns of a conversation into a concise summary using an LLM. This preserves the "gist" while freeing up token space.
Pruning/Trimming: Programmatically remove old messages, "chit-chat," or redundant tool outputs. For example, once a tool has returned a massive JSON blob, you might only keep the specific field the agent needed and discard the rest.
Compaction: Anthropic suggests "compaction," where the model summarizes the message history to preserve architectural decisions and unresolved bugs while discarding redundant data.

4. Isolate: The Sub-Agent Architecture

One of the most effective ways to manage context is to split the workload.

Orchestrator-Worker Pattern: A "lead" agent breaks a complex task into sub-tasks and assigns them to specialized "worker" agents.
Context Hygiene: Each worker agent operates with its own clean context window. A "Research Agent" might scour the web with thousands of tokens, but it only returns a 500-token summary to the Lead Agent. This prevents "context pollution" where raw data from one task confuses the reasoning for another.

Standardization: The Rise of MCP and A2A

As context management becomes more complex, standardized protocols are emerging to replace brittle, custom integrations.

Model Context Protocol (MCP): Developed by Anthropic, this is described as the "USB-C for AI applications". It standardizes how AI agents connect to data sources (resources), execute actions (tools), and receive instructions (prompts). It allows an agent to connect to a GitHub repo or a Google Drive without custom code for every data type.
Agent-to-Agent (A2A) Protocol: While MCP connects agents to data, A2A connects agents to each other. It allows a client agent to outsource a task to a remote agent using a standardized "Agent Card" that advertises capabilities.

The Bottom Line: Context is a "Silent Tax"

Every token passed to a model is a recurring cost item—both financially and in terms of latency. By shifting from "context stuffing" to disciplined Context Engineering, you move from building expensive, hallucination-prone demos to deploying scalable, cost-effective intelligent systems.

The future of AI isn't just about smarter models; it's about smarter context.