Model Watch

Context engineering: feeding the agent more made it worse

We held one model fixed and changed only how the agent was built around it, and the same model scored anywhere from 67 to 95 percent. The biggest jump came from a setup that cost less than the do-nothing baseline, which gave the agent a short plan and had it check its own output before finishing.

PCTX Editorial · Jun 5, 2026 · 4 min

A working definition of context engineering

Context engineering is the practice of managing everything a model sees while it works. That covers the instructions you give it, the tools you expose, the data you pull in, and the partial results the agent produces as it goes. Wording the prompt is one piece of it. The larger job is deciding what belongs in the context window at each step, and what to keep out.

A definition does not tell you how much any of this changes the result, so we ran the experiment.

On a single fixed model, the same model scored anywhere from 67 to 95 percent, depending only on the context and build around it. The cheapest setup we tried was also the most accurate, and the two that loaded the most into the context finished last.

How we ran it

The runs come from our open benchmark for AI agents, the Agent Voyager Project (AVP). Every setup used the same cheap frontier model, Claude Haiku 4.5, on the same task, reading a dense PDF page and rebuilding it as a structured HTML table, ten pages each. The model never changed between runs. Only the build around it did.

Setup	Accuracy	Cost/run	Pass rate
Plan + a self-check step	95%	$0.33	10/10
Plain prompt (baseline)	82%	$0.35	9/10
Plain prompt + packaged Skill	81%	$0.31	8/10
Plain prompt + external tool	70%	$0.20	9/10
Terser prompt	68%	$0.59	7/10
Worked example (few-shot)	67%	$0.82	8/10

The full run is in Captain's Log #1.

More context made the agent worse

The two builds that loaded the most into the context finished at the bottom. The worked example, a few-shot setup that drops a solved case into the prompt, came last at 67 percent and cost more than any other run. The packaged Skill, a prebuilt bundle of instructions and helpers, landed a point below the plain prompt that did nothing special.

Both of those builds add material to the context, and both spent accuracy and money to leave the agent slightly worse off than the plain prompt did.

The same pattern shows up outside our benchmark. A study of cross-component interference in agent scaffolding found that loading in every available component degrades performance, and that a trimmed-down subset beats the fully equipped agent (More Is Not Always Better).

A plan and a self-check beat every upgrade

The build that scored highest, at 95 percent, did the opposite of piling on. It gave the agent a short plan to follow and one closing instruction, to check its own work before handing it back.

Because the agent stopped turning in output it would only have to redo, that build also cost less to run than the plain prompt it beat. The change that helped most was also the most ordinary one we made, a plan plus a self-check.

This is the part of context engineering that pays, and it has little to do with volume. The winning build kept only a few of the right things in front of the model, a clear task, the source it needed to read, and a moment near the end to reread its own output.

Context engineering versus prompt engineering

These are not the same thing, and the run shows how far apart they sit. Prompt engineering is the wording of the instruction. On its own it barely changed the score here, since the terser, more carefully worded prompt scored 68 percent, one point above the worst run.

Context engineering is the wider set of choices about what the model sees and does, and almost the whole 28-point difference came from there. Prompt wording is one tool inside it, and on this task it was close to the weakest one we had.

Where to put your effort

If you run agents and you are deciding where the next hour of work goes, context engineering beats prompt tweaking, and short of the frontier it beats shopping for a bigger model. Ethan Mollick made the general version of this point in February, writing that for most people the gaps between models have shrunk to the point where the app and harness around them matter more.

Our run puts a number on it. On one fixed model, the context around it was worth 28 points, and the change that returned the most was also the cheapest, giving the agent a plan and telling it to check its own work.

Common questions

What is context engineering?

Context engineering is the practice of managing everything a model sees while it works, from the instructions and the tools you expose to the data you pull in and the partial results the agent produces. Wording the prompt is one part of it. The larger job is deciding what belongs in the context window at each step.

Is context engineering the same as prompt engineering?

No. Prompt engineering is the wording of the instruction. Context engineering is the wider set of choices about what the model sees and does. In our benchmark, prompt wording was worth about a point, while the broader context choices were worth 28.

What is the highest-impact context-engineering change?

In our open agent benchmark, the largest gain came from a near-free change, giving the agent a short plan and a step to check its own work before returning it. That setup scored 95 percent and cost less than the do-nothing baseline.

Does adding more context make an agent more accurate?

Not on its own. The two setups that loaded the most into the context, a worked example and a packaged Skill, finished worst and no better than baseline. Structure and a self-check beat sheer volume.