Prompt engineering is becoming context engineering

Two years ago the interesting skill was how to phrase an instruction. Today the interesting skill is what to put around the instruction. Prompt engineering is quietly becoming context engineering, and the job is much less like copywriting than most current guides still make out.

The shift isn't dramatic from the outside. The user still sees a text box. They still type something. The model still responds. What has changed is what happens behind the text box. A modern production deployment is pulling in a retrieval step, a set of callable tools, a skill registry, the user's interaction history for this session and probably a summary of prior sessions, a system prompt that configures behaviour, and potentially a per-tenant configuration layer. The raw user instruction is five percent of the context. The other ninety-five percent is engineering.

The 70% rule

A heuristic that has emerged in practice, and that matches the numbers I've seen across multiple teams, is that model quality peaks when the context window is filled to roughly seventy percent of its capacity. Below forty percent, you're leaving performance on the table because the model doesn't have enough to work with. Above eighty percent, you start seeing latency rise faster than quality, and above ninety percent you get context collapse - the model gets confused about what's load-bearing and starts ignoring parts of its input.

Seventy percent is a surprisingly narrow target, and it is not the target most teams aim at. The default behaviour of any retrieval system is to stuff the context until the token budget is gone. Designing for seventy percent means leaving room for the user's input, for tool outputs during multi-turn interaction, and for the iteration history to accumulate without spilling the window. A retrieval step that returns exactly enough is more valuable than one that returns everything.

The three layers

The structure that has shaken out in production deployments is three roughly separable layers: skills, tools, and history. Skills is the model's static knowledge base - prompts, retrieved documents, persona definitions, domain-specific patterns. Tools is the functional layer - APIs the model can call, functions it can invoke, services it can query. History is everything the model has seen in this interaction and possibly prior ones - user messages, tool outputs, intermediate reasoning traces.

Treating these three as separate engineering problems, with separate owners, works better than pretending they're one system. A content engineer owns the skill layer: what gets retrieved, how it's ranked, how stale it's allowed to get. A platform engineer owns the tool layer: what functions the model has access to, how they're secured, how their outputs are shaped. A product engineer owns the interaction history: what gets remembered between turns, what gets summarised, what gets dropped. Three different disciplines. Most teams conflate them into one 'prompt team' and then wonder why the outputs are uneven.

A retrieval step that returns exactly enough is more valuable than one that returns everything.

What this does to hiring

The hiring market has not caught up with the shift. 'Prompt engineer' as a job title still implies a copywriter with good intuition about language. The job, in a serious production deployment, is closer to an information architect. You are deciding what information gets to the model, in what order, with what emphasis, under which conditions. Writing the top-line instruction is a small part of that work, and the candidates who can write beautiful instructions without thinking about retrieval architecture are not the ones you want.

A practical reframe for the hiring board: ask candidates to describe the last context-management bug they debugged. If they can't produce a coherent answer, they've not been doing the engineering work. Ask what their retrieval strategy was for a specific product and why. Ask how they measured context utilisation. These questions separate candidates who have done the work from candidates who have read about it.

Auditability and compliance

The governance implication of this shift is that the model's output depends on what happened to be in its context, and the context is now dynamic, retrieved on the fly, varying per interaction. For a regulated domain, this is a non-trivial audit problem. Every answer the model gives is conditional on retrieval decisions that happened two hundred milliseconds before. If you want to reproduce the reasoning a month later, you need the retrieval state preserved, not just the prompt and response.

Most log systems are not designed for this. They log the prompt, the response, and nothing in between. The result is that incident reviews in context-engineered systems are frustrating exercises in guessing what the retrieval layer might have returned. Plan for this before you deploy. Retrofitting full-context logging later is painful and politically expensive.

What to measure

The metrics worth tracking, in rough order of usefulness: retrieval precision at position one (was the top-ranked chunk the one that answered the question), context utilisation (tokens passed as a fraction of window capacity, with seventy as the target), tool-call success rate, and end-to-end task completion. The first two are the most diagnostic. If retrieval precision at one is poor, everything downstream suffers regardless of prompt quality. If utilisation is pinned at ninety-five, you're one context-collapse incident away from an embarrassing outage.

The engineering craft for the next two years is measuring this stack and tuning it. The craft is less about writing the perfect instruction and more about designing the information environment around the instruction. The title will catch up with the work eventually. The work is already different.

← Previous

Graceful degradation is the only AI strategy that survives contact with regulators

The jagged shape of machine intelligence