AI Agents

Your AI Agents Are Not Reading Your Instructions

By Greg Arnold

Your AI Agents Are Not Reading Your Instructions — prose vs imperative format comparison

You wrote three paragraphs of constraints. Your agent read the first sentence and guessed the rest.

Not because it skipped the rest. Because that's how inference works. Language models are completion engines. When you write context-heavy prose about what you need, the model generates what statistically comes next after that kind of writing: a vague instruction, executed confidently.

Your careful exceptions got treated as color commentary. The constraint buried in paragraph three got summarized away. The specific thing you really needed not touched? The model processed it, weighted it against every other token in the context window, and decided its interpretation was probably right.

Spoiler: it wasn't.

You've done this yourself

Think about onboarding at a new job. Your manager handed you a 30-page process document and said read this before you begin. And you skimmed it. You hit the highlights. You extracted what felt actionable and inferred the rest from context clues.

Nobody expects you to internalize every nuance in a binder before doing anything. The binder is reference material. The checklist is instruction. Your AI agents are doing exactly what you did: reading enough to start moving, filling gaps with reasonable inference, and proceeding with confidence that the output will be close enough.

"Close enough" in production is never close enough.

Prose allows inference. Imperatives don't.

The difference between these two instructions is not length:

Prose: "When making changes to the authentication system, you should be careful to avoid modifying any existing session handling logic, especially if it relates to the token refresh flow, since that part of the codebase is particularly sensitive and changes there can have downstream effects on the mobile clients."

Imperative: "DO NOT modify session handling or token refresh logic. Any change in /auth/session.py requires explicit approval."

Both say the same thing. One leaves room for interpretation. The other doesn't.

An agent reading the prose version has latitude to decide what counts as "modifying" session logic, what counts as "particularly sensitive," and what constitutes acceptable risk. That latitude is the problem.

Prose instruction vs. imperative format — side by side before and after reformatting

The imperative version is a fence. You're either inside it or you're not. There is no inference step that gets you past it.

Agents are inference engines. Every piece of prose you write is an invitation to infer. An imperative removes the invitation.

What the numbers say

After 14 months building GeoScored entirely with AI agents — 1,009 tickets, 1,000+ merged PRs, 960K lines of code — we tracked rework cycles across every task.

Average rework cycles per task dropped from 1.8 to 0.6 after we overhauled how we structured briefs. That's a 67% reduction. Not from better models. Not from more expensive infrastructure. From changing the format of how we write instructions.

Instruction format was the single biggest lever in that improvement. The academic literature points the same direction: Zamfirescu-Pereira et al.'s CHI '23 study "Why Johnny Can't Prompt" found that people default to narrative prompting and experience significantly higher error rates compared to structured imperative formats. We see the same pattern in production, at scale, consistently.

Rework rate comparison: before instruction format overhaul vs. after — 67% reduction in rework cycles

The brief got shorter when we made this switch. Shorter, and with more complete execution on the first pass. Less writing produced better results. That's the counterintuitive part.

The four ways prose fails

It's not random. Prose fails in predictable ways, and once you see the pattern, you can't unsee it.

4 ways prose instructions fail agents: summarization, inferred overrides, soft language, default behavior wins

Summarization. Agents distill prose to its essence. If your constraint is embedded in a longer explanation, the distillation step extracts the explanation and drops the constraint. The agent now has the context without the guardrail.

Inferred overrides. Explaining WHY grants the agent latitude to deviate when it believes the "why" condition is satisfied. You wrote "don't modify the token refresh because the mobile clients depend on it." The agent looks at the diff, decides the mobile clients aren't affected in this case, and proceeds. You gave it the exit ramp.

Soft language. "You should probably" and "generally speaking" signal guidance, not rules. The agent calibrates accordingly. Hedging language in instructions is a direct communication that compliance is optional.

Default behavior wins. When prose produces ambiguity, the agent resolves conflicts using its training priors, not your intent. Wherever your instruction is unclear, the default behavior fills the gap. You think you covered it. You didn't.

Imperatives eliminate all four failure modes by design. There's nothing to summarize, no "why" to override them, no hedging to calibrate against, no ambiguity to resolve.

Pull quote: Agents comply with imperatives. They skim narratives.

What to do about it

The fix is not complicated. It took one afternoon to rewrite our brief templates. The improvement started immediately.

Move all constraints to a CONSTRAINTS block. Every prohibited action goes there, formatted as "DO NOT [action]." No explanation inside the block. Explanation goes in a separate CONTEXT section the agent can reference but that doesn't compete with the instructions.

Move all requirements to a REQUIRED block. Positive obligations get their own section too. "Match component structure in /components/" not "try to keep things consistent with what's already there."

Kill the soft language. "Should," "probably," "generally," "ideally" — if it hedges a constraint, cut it. If you wouldn't accept that behavior, don't write the instruction as if you would.

Separate context from instruction. Agents need context to do good work. But context mixed into instruction blocks is the enemy. Put the why in CONTEXT. Put the what in CONSTRAINTS and REQUIRED. Keep them physically separate.

A brief that used to read: "Make sure the UI doesn't break anything that's already working, and try to keep the component structure consistent with what's already in the codebase, especially around how we handle form states"

Now reads:

CONSTRAINTS:
- DO NOT modify existing form state logic
- Match component structure in /components/forms/
- All existing tests must pass
- No new dependencies

One section. Four lines. Zero ambiguity. The agent either followed the constraints or it didn't, and you can verify that with a checklist instead of a code review.

The structural principle underneath all of this

This is the same principle that determines whether AI search engines can extract and cite your web content. GeoScored audits websites for AI extractability — the ability of AI systems to reliably understand and cite your pages. And the pattern we see on every audit: content that performs well in AI-generated answers is structured the same way effective agent instructions are. Clear claims up front. Self-contained passages. No inference required to understand what the content is saying.

AI systems are not readers. They are extractors. Whether they're extracting a recommendation from your website or an action from a task brief, the format determines what they can reliably get out of it.

Pull quote: The format of your instruction IS the instruction.

You cannot out-explain the inference step. You can only eliminate it.

Structure your instructions the same way you'd structure content you want an AI to cite accurately: lead with the claim, make every constraint explicit and self-contained, keep context separate from imperative. The agents reading your briefs and the AI systems reading your website are operating on the same underlying principle.

Format is the fix.


How well-structured is your content for AI extractability? Run a free AI Visibility Screening at geoscored.ai to see where your site stands — the same structural principles that make agent instructions reliable make web content citable.


Sources

  • GeoScored production data. 14 months, 1,009 tasks, 1,000+ merged PRs, 960K lines of code, 22 AI agent specialist profiles. Internal metrics tracked from GEO-1153 through GEO-1703 (2025-2026).

  • Aggarwal, P., Murahari, V., Rajpurohit, T., et al. "GEO: Generative Engine Optimization." arXiv:2311.09735. Princeton University, Georgia Tech, Allen Institute for AI, IIT Delhi. 2023, revised 2024. Demonstrates 30-40% increased AI visibility from structured content formats. https://arxiv.org/abs/2311.09735

  • Gartner. "Top 10 Strategic Technology Trends for 2026." Context engineering identified as a breakout competency. Median salary $141K, 30-50% premium over generalist AI engineers. https://www.gartner.com/en/newsroom/press-releases/2025-10-20-gartner-identifies-top-10-strategic-technology-trends-for-2026

  • Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. "Why Johnny Can't Prompt." CHI '23. Demonstrates that users default to narrative prompting strategies and experience significantly higher error rates compared to structured imperative formats. https://dl.acm.org/doi/10.1145/3544548.3581388