What We've Learned About Teaching Machines to Think

2025 Prompt Lessons so far

Sep 01, 2025

Models are radically literal. If you want something, you must explicitly ask for it. This isn't a limitation, it's a feature that makes behaviour predictable.

Ambiguity is the enemy of performance. Every piece of vagueness in your prompt is a dice roll on the output. The model shouldn't have to guess anything not length, tone, format, or purpose.

Positive instructions outperform negative constraints. Telling a model what to do works better than telling it what not to do. "Write in plain prose" beats "Don't use markdown."

Structure trumps sophistication. A well-organised prompt with clear sections outperforms a clever prompt every time. The era of "prompt hacks" is over.

Examples are more powerful than explanations. Showing the model what you want through few-shot examples is more reliable than describing it, no matter how detailed your description.

Order matters more than we thought. Sequence’s help: Role → Context → Task → Input Data → Output Format → Constraints.

Context isn't optional, it's foundational. Without adequate context, models default to generic responses. Context doesn't just inform, it fundamentally shapes the response space.

Format specification prevents drift. Explicitly defining output structure isn't pedantic it's the difference between usable and unusable outputs in production systems.

Delimiters create cognitive boundaries. Using XML tags, triple quotes, or markdown headers isn't just organisation it helps models understand where one type of information ends and another begins.

"Let's think step by step" remains magical with simple models. This simple phrase can improve accuracy by 40% on complex tasks. It's not a trick it fundamentally changes how the model processes the problem.

Planning before execution prevents errors. Separating "devise a plan" from "execute the plan" catches missing steps that sequential reasoning often skips.

Multiple reasoning paths beat single threads. Running several reasoning chains and taking the majority answer reduces random errors significantly.

Models can meaningfully critique their own work. The same model that generates output can identify flaws in it when explicitly asked to switch perspectives.

Verification must be independent. When fact-checking, the model must answer verification questions without referencing its original response to avoid confirmation bias.

Models have multiple "modes" of operation. They can switch between generating, critiquing, and verifying, each accessing different capabilities.

Recursive self-improvement works. Models can iteratively refine their own outputs, often reaching quality levels impossible in a single pass.

Assumptions must be made explicit. Forcing models to state their assumptions before reasoning dramatically improves logical consistency.

Prompts are code, not conversations. They need version control, testing, rollback capabilities, and environment-specific configurations.

Evaluation must be systematic. Anecdotal improvement isn't enough, you need benchmarks, metrics, and automated testing.

Human-in-the-loop checkpoints are crucial. Especially for high-stakes applications, human verification between steps prevents cascade failures.

Tracing is non-negotiable for complex systems. When multi-step agents fail, you need detailed logs of every thought, tool call, and decision.

Prompt drift is real and dangerous. Without version control, prompts evolve chaotically, making debugging and improvement impossible.

Format learning occurs independently of content. Models learn structure from examples even when the content is randomised, suggesting pattern recognition operates deeper than semantic understanding.

Position bias affects few-shot learning. The order of examples matters, randomising them prevents the model from learning spurious patterns.

Self-evaluation instructions measurably improve quality. Simply adding "double-check your answer" at the end of prompts produces better outputs.

Models can fact-check themselves effectively. Chain-of-Verification reduces hallucinations by forcing independent verification of claims.

OpenAI prioritises programmatic control. API parameters like reasoning_effort provide more reliable behaviour modification than prompt engineering alone.

Anthropic treats prompts as formal documents. XML tags aren't just organisation—they map to how Claude processes information hierarchically.

Google believes data speaks louder than instructions. Their models respond better to patterns in examples than to explicit rules.

Each platform has a "house style" that matters. What works optimally for GPT-5 may underperform on Claude 4 or Gemini.

Reasoning and action must be interleaved. Pure reasoning without action feedback leads to drift; pure action without reasoning leads to errors.

Tools need introduction, not just definition. Models must understand not just what tools do, but when and why to use them.

Transparency instructions improve trust and debugging. Having agents explain their actions before, during, and after execution makes failures diagnosable.

Parallel processing beats sequential for independent tasks. Running multiple agents simultaneously on different problems transforms productivity.

….

We're formalising human thought patterns. Every effective prompting technique mirrors a human cognitive strategy we've always used but never articulated.

Prompting is becoming cognitive architecture. We're not writing instructions anymore, we're designing thinking processes.

The best prompts are invisible. When prompting works perfectly, the user never sees the machinery, only the result.

Domain expertise matters more than prompt expertise. A doctor who understands prompting beats a prompt engineer trying to write medical prompts.

Prompts are an evolving into prompt science. We're moving from trial-and-error to systematic, theory-driven design.

Strategic Principles

Start simple, add complexity only when needed. Zero-shot → Few-shot → Advanced frameworks. Don't use a cannon to kill a mosquito.

Test systematically, not anecdotally. One good output doesn't mean the prompt is good. One bad output doesn't mean it's bad.

Design for failure modes, not just success cases. What happens when the model misunderstands? Build in graceful degradation.

Treat prompts as living documents. They need maintenance, updates, and deprecation cycles like any other code.

The goal is cognitive reliability, not intelligence. We want predictable, verifiable reasoning processes, not clever but unreliable responses.

The Future We're Building

Automatic prompt optimisation is becoming real. AI systems that generate and refine their own prompts are moving from research to production.

Prompts are becoming modular and composable. Like software libraries, we're building reusable prompt components that can be mixed and matched.

The human role is shifting from instructor to architect. We're designing the systems that design the prompts that guide the thinking.

New cognitive strategies are emerging. The interaction between human and machine cognition is producing reasoning patterns neither could develop alone.

We're creating a formal language for thought itself. Prompt engineering is becoming the grammar of a new cognitive linguistics.

Conclusion

We set out to teach machines to think, and discovered we'd been unclear about thinking all along.

The literal minds we've created have become our most honest critics. They don't respond to charm or insinuation. They don't fill in what we meant to say. They do exactly what we tell them, and in that exactness lies a revelation: most of our instructions to each other, and to ourselves, are full of gaps we never knew existed. Every failed prompt is a map of our own fuzzy thinking made visible.

This isn't humbling; it's liberating. When you know a model needs explicit structure, you start providing it. When you discover that positive instructions work better than prohibitions, you reshape how you frame problems. When you learn that examples teach better than explanations, you begin collecting patterns instead of writing manifestos. The discipline of prompt engineering is secretly a course in cognitive hygiene, teaching us to rinse the ambiguity from our thoughts until what remains is crystal clear.

The aesthetics of this clarity are surprisingly beautiful. Not the baroque beauty of complex prose, but the Euclidean elegance of a proof. Clear delimiters between sections aren't just formatting; they're an act of kindness to the reader, whether human or machine. Named roles aren't bureaucracy; they're cognitive landmarks that prevent us from getting lost. When we separate planning from execution, we're not being pedantic; we're acknowledging that thinking and doing are different modes that deserve different moments.

What emerges from all this structure isn't rigidity but reliability. A good prompt is a contract that makes promises about what will happen next. It allocates attention along stable paths. It makes the space of possible errors smaller than the space of possible truths. This is what good thinking has always done, we just never had to articulate it before. Now we do, and the articulation itself is making us better thinkers.

The ethical implications run deeper than we expected. Every prompt encodes values: what counts as sufficient evidence, which errors matter, whose voice gets heard. When we version control our prompts, we're not just tracking changes; we're creating accountability. When we trace every reasoning step, we're building a public ledger of how conclusions earn the right to exist. House styles become house ethics. The way OpenAI, Anthropic, and Google each approach prompting reveals their different philosophies about what thinking should be.

But perhaps the most profound shift is from seeing intelligence as a thing to seeing it as a process. When we use a reasoning model, it works not because we've activated some hidden consciousness, but because we've decomposed a problem into units that can be verified. Intelligence here isn't mystical; it's procedural. It's an ecology of modes that check and balance each other: generation, critique, verification, refinement, each with its own standards and schedules.

We're moving from being instructors to being architects. We no longer write commands; we design systems that generate their own commands. We build agents that plan their work, do their work, check their work, and improve their work, all without our intervention. The human role is shifting to something both humbler and more powerful: we're the ones who decide what "better" means.

The tools we're building reflect this shift. Prompts are becoming modular, composable, inheritable. They're growing version numbers and test suites. They're spawning frameworks. We're watching the birth of a new kind of literacy, where being able to structure thought formally is as important as being able to write clearly. The grammar of thought we're developing doesn't replace human judgment; it makes human judgment transmissible.

And here's the twist that makes it all worthwhile: in teaching machines to think, we're discovering thoughts we couldn't have had alone. The interaction between human creativity and machine literalism is producing reasoning patterns neither could generate independently. We're not replacing human intelligence; we're creating a new kind of cognitive partnership where our intuitions become hypotheses, their processing becomes verification, and the cycle between us becomes a new form of discovery.

The deepest lesson of 2025 isn't that machines can think. It's that thinking itself is more structured, more decomposable, more shareable than we ever imagined. Each successful prompt is a small theorem about how reasoning works. Each failure is a window into where confusion lives. We're not just building better AI; we're building a science of thought itself, one carefully structured prompt at a time.

This is the real revolution: not artificial intelligence, but articulated intelligence. Not machines that think like humans, but humans who can finally explain how thinking works. The prompts we write today aren't just instructions to machines; they're the first draft of a manual for the mind itself. And in learning to speak to literal minds, we're becoming more precise with our own thoughts, more honest about our assumptions, more rigorous in our reasoning.

The future isn't about machines becoming more human. It's about humans and machines together becoming something neither could be alone: thinking that's both creative and verifiable, both intuitive and inspectable, both fluid and formal. We're not teaching machines to think. We're learning, together, what thinking really is.

Addendum

Master all these rules so you can detonate them.

The best creative explosions happen after maximum precision. You spend hours getting every fact verified, every structure mapped, every assumption explicit. Then you take that crystalline output and feed it back with impossible instructions: "Make it a detective story." "What color does this sound like?"

This isn't randomness. It's controlled demolition by someone who knows exactly which walls are load-bearing.

The technique: Build something perfect with all that cognitive architecture. Then prompt: "Take this verified knowledge and find the weird connections. What patterns emerge if gravity worked backwards? If time was a flavor? If logic was a dance?"

Watch what happens. The model, freed from consistency but armed with solid knowledge, starts finding truths that consistency would have hidden. Financial models reveal themselves as murder mysteries. Medical diagnoses become symphonies.

The sequence matters: First, build something unshakeable. Then shake it until something new falls out.

This is the real frontier: teaching machines not just when to think clearly, but when to stop. When to take verified facts and ask "What if everything we just proved was upside down?" When to use precision as a trampoline for impossibility.

It's about being so accurate that your creativity has teeth, so creative that you discover new things to be accurate about. It's about prompts that say: "Think rigorously, then forget everything except the patterns and dream."

The best thinking doesn't live in structure or chaos. It lives in the explosion between them. Master the rules. Then light the match.

Christine Whitmarsh

Sep 1

Carlo this is "holy crap I'm printing this right now" awesome! Yes, I'm like the people in the insurance commercial printing the internet lol. Because I like to sit outside, unplugged, read, think & write and this hits all 3 of those things.

Expand full comment

1 reply by Carlo Iacono

Marcel

Fantastic and exhaustive rundown of the best observations about prompting I’ve read !

11 more comments...

Hybrid Horizons: Exploring Human-AI Collaboration

Discussion about this post