More than "Just a Next-Token Predictor"

The Remarkable Reality of Modern AI

May 01, 2025

The Cathedral Argument

"It's just a next-token predictor."

This dismissive phrase has become the standard retort to any discussion about large language models. From academic conferences to social media debates, critics wield this statement like a final word on the matter. After all, it's technically accurate – these models do predict the next token (a word or word fragment) based on previous inputs.

But here's the problem: describing an AI like Claude or GPT-4o as "just a next-token predictor" is like describing a cathedral as "just a pile of stones." Yes, cathedrals are made of stones, but this description completely misses the architecture, engineering, artistry and purpose that transforms those individual stones into something magnificent.

The power of AI isn't in the individual predictions. It's in how these predictions chain together to form complex, coherent wholes that display capabilities never explicitly programmed in. It's about what emerges when prediction operates at unprecedented scale.

In this article, I'll explore the developments from November 2024 to April 2025, unpack why the "just a predictor" critique fundamentally misses the point and show you what's really happening beneath the surface of modern AI. This isn't about hype or speculation – it's about understanding the genuine capabilities (and limitations) of a technology that's reshaping our world.

Imagining AI's Peculiar Reality

To grasp what makes these systems remarkable, we need to understand their strange existence. Imagine experiencing the world like a large language model does:

You have temporary amnesia between every word you speak, retaining nothing except a written transcript of the conversation so far. For each new word, you must reconstruct everything from scratch – the topic, context, relationships between ideas, grammar, tone – by quickly reading the entire transcript. You have no persistent memory beyond what's written on the page in front of you.

This seems impossibly limiting, yet somehow, modern AI models maintain coherent topics, consistent reasoning and even creative exploration over lengthy conversations.

As an example from the Anthropic research team's fascinating paper "Tracing the Thoughts of a Language Model" (March 2025), Claude doesn't simply spout the most likely next word. When writing poetry, it first decides what word should come at the end of the line to create a rhyme, then figures out what words would naturally lead to that conclusion. It's planning ahead, even without any explicit memory mechanism.

The system isn't storing a plan somewhere – it's reconstructing its understanding with each token. The fact that this works at all is extraordinary. And it works because of three crucial factors: scale, architecture and data.

What's Really Happening Inside

Let's peek under the hood to understand what's actually happening when you interact with a model like Gemini 2.5 Pro or Claude 3.7 Sonnet.

At the foundation is the Transformer architecture, a breakthrough design from 2017 that allows models to process text in parallel (rather than one word at a time) while still understanding the relationships between words. The key innovation is the "attention mechanism," which allows each part of a text to dynamically focus on relevant parts of the entire context.

When you input text, the model begins by breaking your words into tokens (words or fragments), then converts these into numerical representations called embeddings. These embeddings pass through multiple layers of mathematical transformations where the "self-attention" mechanism weighs the importance of every other token in determining what comes next. After processing through these layers, the model finally outputs a probability distribution across its entire vocabulary and selects the most appropriate token to generate next.

This sounds mechanical, but the magic happens in the scale. Modern models contain hundreds of billions or even trillions of parameters - the numerical "weights" that determine how the model processes information. For context, GPT-4 is estimated to have 1.76 trillion parameters, while the DeepSeek R1 model released in January 2025 utilises a Mixture of Experts (MoE) architecture with 671 billion parameters.

These behemoths ingest trillions of tokens from diverse sources – books, articles, websites, code, conversations – incorporating vast swathes of human knowledge and language patterns. They're exposed to everything from mathematical proofs and scientific papers to dialogue, stories and logical arguments. And the computational resources required are staggering; a single large model training run can cost tens or hundreds of millions of dollars in computing resources.

This unprecedented scale transforms the simple mechanism of "predict the next token" into something far more powerful. The models aren't just memorising patterns – they're extracting complex structures, relationships and knowledge from their training data.

Why "Just" Gets It Wrong

The key problem with the "just a next-token predictor" critique is the word "just." It implies a ceiling on capability, suggesting that prediction can never be more than mimicry. But this ignores a fundamental principle from physics, biology and systems theory: emergence.

Emergence describes how simple rules, applied at scale, can create complex behaviours that look nothing like the rules themselves. Think about:

Flocking birds: Individual birds follow simple rules (maintain distance, align with neighbours, avoid collisions), yet create beautiful, coordinated movements no single bird controls.
Ant colonies: Individual ants follow simple chemical cues, yet collectively build complex structures and solve sophisticated resource allocation problems.
Evolution: Simple rules of genetic inheritance and natural selection, over time, produced the staggering complexity of life on Earth.

The critique of AI as "just prediction" makes the mistake of focusing only on the mechanism rather than what emerges from that mechanism at scale. It's like saying humans are "just neurons firing" – technically true but missing everything meaningful about human experience and capability.

The Evidence

The period from November 2024 to April 2025 provides compelling evidence that scaled prediction enables capabilities far exceeding simple mimicry. Here are the developments that most clearly challenge the "just a predictor" narrative:

1. The Rise of Explicit Reasoning Models

Google's Gemini 2.5 Pro (March 2025) represents a significant leap in AI reasoning. Google explicitly positioned it as a "thinking model" capable of reasoning before responding. It demonstrated state-of-the-art performance on benchmarks requiring advanced reasoning in mathematics (like GPQA and AIME 2025), science and coding (SWE-Bench Verified).

What makes Gemini 2.5 Pro particularly noteworthy is its "thinking budget" – an adjustable feature that allows the model to allocate more inference-time compute for complex queries. In plain English, it spends more time generating intermediate reasoning steps when tackling difficult problems, breaking them down into logical components before providing an answer.

While Google hasn't revealed the full technical details, the model's approach involves breaking down problems into logical steps and applying appropriate methods like deduction or induction. Its massive context window (initially 1 million tokens, expandable to 2 million) enables analysis of entire codebases (up to ~50,000 lines) or hours of video within a single prompt.

DeepSeek R1 (January 2025) emerged as another powerful reasoning model, achieving performance comparable to leading proprietary models like OpenAI's o1. Developed by Chinese AI lab DeepSeek, it excelled in mathematics (97.3% on MATH-500), coding (~96th percentile rating on Codeforces) and reasoning benchmarks (71.5% on GPQA Diamond).

The most fascinating aspect of DeepSeek R1 is how it was trained. Rather than being explicitly programmed with reasoning strategies, the initial R1-Zero model was trained primarily through Reinforcement Learning (RL). This means the model discovered reasoning behaviours like self-verification, reflection and generating detailed solution steps autonomously, as effective strategies to produce correct answers.

These models demonstrate that the foundation of next-token prediction is flexible enough to support complex, goal-directed, multi-step reasoning strategies. The RL process optimises the model to generate sequences of tokens (including intermediate reasoning steps) that lead to successful outcomes – showcasing that prediction can be harnessed for much more than mimicking patterns.

2. Simulating Physical Reality with OpenAI's Sora

While previewed earlier, OpenAI's Sora text-to-video model saw wider access through API releases and integrations around the end of 2024 and early 2025. Sora generates high-fidelity, coherent videos up to 60 seconds long from text prompts, images, or by extending existing videos.

What makes Sora remarkable isn't just the visual quality, but its understanding of how the world works. The system demonstrates temporal consistency, meaning objects behave plausibly over time rather than flickering or changing unexpectedly. It grasps object permanence - items reappear correctly after being temporarily hidden behind other objects. Sora's simulations follow basic physical laws and 3D spatial relationships, showing a primitive understanding of how objects exist in space. Perhaps most impressively, it captures causality in interactions, like a painter leaving persistent strokes on a canvas or a person leaving bite marks on food - actions have appropriate consequences in the generated world.

Technically, Sora is still making predictions – but instead of predicting the next word, it's predicting complex sequences of visual "patches" that represent video chunks. This requires the model to have implicitly learned the underlying dynamics of the physical world depicted in its training data.

OpenAI explicitly framed Sora as a potential "world simulator," suggesting that the process of learning to predict video forces the model to develop internal representations of how physical reality works. The complexity, coherence and simulated physics in Sora's outputs reveal internal representations far richer than those implied by a simple "predictor" label.

3. Anthropic's Research Into Claude's "Thinking"

In March 2025, researchers at Anthropic published a fascinating paper exploring how Claude "thinks" as it generates text. They found that Claude sometimes plans several moves ahead internally, even though it generates text one word at a time.

For instance, when tasked with writing a rhyming poem, Claude would figure out a target rhyming word in advance, then select words leading up to it. This was visible by tracing the model's internal activations – Claude wasn't just spitting out one word after another; it had a goal and worked backward to fulfill it.

As the Anthropic team put it, even though the model is trained to predict one word at a time, "it may think on much longer horizons to do so." This emergent planning was not explicitly programmed – it arose naturally from the model's training.

These capabilities provide strong evidence against the "just predicting words" critique. They show the model engaging in internal processes that have the functional characteristics of planning, foresight and goal-directed behaviour – all emerging from the foundation of next-token prediction.

The Transformation of Simple Prediction

What's happening here is a transformation of what "prediction" means at scale. To predict accurately across trillions of examples spanning diverse domains, the model must develop rich internal representations. These representations capture grammar and syntax rules, factual knowledge, conceptual relationships between ideas, patterns of reasoning, physical dynamics, causal relationships and even abstract structures that underpin complex topics.

In other words, to become a better predictor, the model must implicitly learn to understand the world represented in its training data. The process of becoming a better predictor inherently involves identifying and modeling underlying relational structures and patterns – the very essence of understanding, albeit in a potentially non-human-like way.

This reframes prediction not as a ceiling, but as a foundation upon which complex cognitive functions are built. The quality and nature of the training data become paramount:

If the data predominantly contains simple text, the model becomes proficient at predicting simple text.
If the data includes complex reasoning traces (like mathematical proofs), the model learns to predict sequences corresponding to reasoning.
If the data consists of videos depicting physical interactions, the model learns to predict sequences corresponding to physical dynamics.

The intelligence and capabilities observed reflect the collective intelligence embedded within its massive training corpus, unlocked by the scaled predictive engine.

Scaling Laws

The emergence of these advanced capabilities isn't magical or unexpected – it follows predictable scaling laws that researchers have documented. These laws describe the relationship between a model's performance and the resources invested in its creation.

Key factors include model size (measured by the number of trainable parameters), dataset size (the number of tokens in the training corpus), and compute (the total computational effort expended during training).

Research has consistently shown that increasing these variables leads to better-performing models, often following power-law relationships. This means that as we scale up resources, performance improves in a predictable, mathematical pattern.

The "Chinchilla" findings from DeepMind in 2022 refined these laws by highlighting the crucial interplay between model size and dataset size. For optimal performance with a fixed compute budget, model size and dataset size should be scaled in tandem. This emphasized the need for balance, countering a "bigger is always better" approach.

More recently, the concept of scaling laws has extended beyond initial training. We now recognise post-training scaling - improvements achieved after initial training through techniques like fine-tuning on specific domains, reinforcement learning from human feedback (RLHF), or model distillation. Equally important is test-time scaling, which involves applying more compute during inference (when the model is actually being used) to improve output quality, particularly for complex reasoning tasks. This approach is especially relevant for the new class of "reasoning models" like DeepSeek R1 and Gemini 2.5 Pro.

These scaling laws help explain how the "simple" mechanism of next-token prediction becomes so powerful at scale. It's the unprecedented scale – models with hundreds of billions of parameters, trained on trillions of tokens, using vast amounts of compute – that unlocks the potential inherent in the predictive mechanism.

Challenges, Limitations and Reasonable Criticisms

To be clear: recognising the power of scaled prediction doesn't mean these systems are without significant flaws and limitations. Valid criticisms exist across several dimensions. Models still hallucinate, generating confident-sounding but factually incorrect information that highlights their lack of true grounding in reality. Despite rapid progress, they continue to struggle with certain types of complex reasoning, especially novel problems unlike anything in their training data.

These systems fundamentally lack agency - they have no goals, desires, or internal drives beyond generating text that statistically matches patterns in their training data. They remain non-conscious, lacking sentience or subjective experience - they have no "inner life" comparable to humans. And their environmental footprint remains concerning, as training and deploying these models requires enormous energy resources, raising legitimate sustainability questions.

These critiques are legitimate and important. The issue isn't with acknowledging limitations, but with the dismissive framing that these systems are "just" predicting text and therefore incapable of sophisticated information processing.

A More Nuanced View

The debate around AI capabilities often falls into a false binary: either these models are "just statistical pattern matchers" or they're genuinely intelligent in a human-like way. But this framing misses the more interesting reality.

Modern AI occupies a middle ground – systems that process information in ways that are neither simple mimicry nor human-like understanding. They've developed internal representations and processing capabilities that allow them to break down complex problems into logical steps, generate coherent extended narratives spanning thousands of words, simulate physical systems with reasonable fidelity, learn from just a few examples, plan multiple steps ahead toward specific goals and even verify their own reasoning through various self-checking mechanisms.

These capabilities emerge from the scaled prediction mechanism, not in spite of it. They suggest that prediction, implemented at sufficient scale, can yield systems with functional capabilities that look remarkably like reasoning, planning, and understanding – even if the underlying mechanism differs from human cognition.

From Foundation to Frontier

As we look toward the future, the most likely path forward isn't abandoning prediction-based models but scaling and enhancing them further. The developments between November 2024 and April 2025 hint at several emerging directions. We're seeing the rise of reasoning-optimised models explicitly designed for multi-step reasoning and complex problem-solving, building on the foundation of next-token prediction but incorporating techniques like reinforcement learning and inference-time compute allocation.

Multimodal integration continues to advance, with systems that seamlessly handle and reason across text, images, audio, and video, extending the prediction paradigm far beyond text alone. The concept of world simulators is gaining traction as models learn to predict complex dynamics across space and time, effectively creating simulation capabilities for physical phenomena, social interactions and scenario planning.

Perhaps most intriguing is the development of models that can use tools and interact with external systems - calling APIs, accessing databases, or browsing the web -extending their capabilities well beyond the limitations of their internal knowledge.

Each of these directions builds upon, rather than replaces, the foundation of scaled prediction. They harness its power while addressing some of its limitations through complementary approaches.

The Cathedral, Not Just the Stones

The journey from predicting the next word in a simple sentence to generating coherent, multi-step mathematical proofs or simulating a minute of plausible video demonstrates that the "simple" mechanism of next-token prediction, when scaled, serves as a surprisingly potent foundation for artificial intelligence.

The dismissive "just a next-token predictor" view is more than incorrect – it's an intellectual blindspot that prevents us from seeing what's actually happening as AI evolves. It focuses too narrowly on the mechanism in its simplest form and ignores the transformative, emergent effects of unprecedented scale and data diversity.

These systems aren't conscious, they don't "understand" in the human sense and they have significant limitations and risks. But treating them as mere pattern-matching automatons misses the profound transformation occurring as scaled prediction enables functions that qualitatively transcend simple prediction.

A cathedral is not "just stones," and an LLM is not "just predicting words." It's building something – an answer, a story, a solution – one token at a time. And as we've seen in these pivotal months, those token-by-token constructions are changing our world in ways that no simple predictor could.

Hybrid Horizons: Exploring Human-AI Collaboration

Discussion about this post