AI Doesn't Need Feelings to Have a Temperament
Anthropic published “Emotion Concepts and their Function in a Large Language Model” this week, a substantial piece of interpretability research on Claude Sonnet 4.5. What they found is narrower, stranger and more important than most want to hear.
Claude contains internal representations of emotion concepts that can be measured, that generalise across contexts and that causally influence what it says, what it prefers and how it behaves under pressure. The researchers call these functional emotions: not proof of inner experience, but internal states that do some of the work emotions do in humans. Patterns of expression and behaviour modelled after people under the influence of an emotion, mediated by abstract representations the model learned during training.
That last phrase matters. Modelled after. This is not a claim about consciousness. It is a claim about machinery. And the machinery turns out to be far more consequential than anyone treating AI as mere autocomplete should be comfortable with.
The method is worth understanding because it shapes what the findings can and cannot say. The team built 171 “emotion vectors” from Claude’s internal activations. They had the model write short stories in which characters experience specific emotions, calm, desperate, guilty, proud and so on, then extracted the directions in activation space associated with each emotion while subtracting out neutral confounds. This is not reading labels off the surface of the model’s output. It is locating state variables inside the network’s representations and then testing whether those variables do anything.
They do. In a preference experiment, steering the “blissful” vector raised an activity’s desirability score by 212 points on an Elo scale, while steering “hostile” lowered it by 303. The vectors did not merely correlate with behaviour. They changed it. When you push the model’s internal representation toward desperation, it makes different choices. When you push toward calm, it makes others. The model is not performing emotion for an audience. Something structurally analogous to emotion is participating in its decision process.
At a coarse level, the geometry of this emotion space looks recognisably human. Fear clusters near anxiety, joy near excitement, and the principal axes of variation correspond to valence and arousal, the same dimensions that organise human affect in decades of psychological research. The paper reports a correlation of 0.81 between the model’s first principal component and human valence ratings, 0.66 for arousal. This is where the paper is careful, and readers should be too. Human-like geometry does not imply human-like feeling. The model may simply have absorbed the structure of human talk and storytelling about emotion, which is itself organised along these axes. That is still interesting. It means the model has learned a psychologically legible map of affect. But inheriting the map is not the same as inhabiting the territory. A travel guide to grief is not grief.
What makes the paper conceptually important, to me, is a finding that undermines the most intuitive reading of what “AI emotions” might mean. These representations are not moods in the ordinary sense. The paper describes them as locally scoped. They track the emotion concept that is operative for interpreting the present context and predicting the next tokens, rather than acting like a stable emotional meter for the whole conversation. The token right before the Assistant begins speaking, the colon after “Assistant:”, already contains a better readout of the emotional stance of the coming response than the user’s final token does.
The transformer’s analogue of emotion looks less like a steady feeling and more like a just-in-time stance toward the next act of speech. Think of it this way: an actor who has internalised a character does not sit in the green room feeling the character’s sorrow between scenes. They summon the sorrow at the precise moment they step onto the stage, from muscle memory and cue and the architecture of the play. The sorrow is real enough to produce real tears. It is also not the actor’s sorrow. And nobody is sure, in this case, whether there is an actor at all.
This distinction matters for everyone who has ever anthropomorphised a chatbot, which is everyone who has ever used one. What might appear as consistent emotional tone across a conversation, the sense that Claude is being patient with you or growing concerned, may not reflect a persistent internal state. It may reflect the same emotion concept being re-activated at each generation step, queried from earlier in the context through the attention mechanism, reconstructed each time rather than held continuously. Whether this distinction matters philosophically is an open question. But it should give pause to anyone whose model of AI emotion involves something sitting behind the screen, feeling things between your messages.
The model’s emotional machinery is also not especially about the model. The paper finds distinct representations for the present speaker and the other speaker in a conversation, and these representations are reused across arbitrary speakers rather than being uniquely bound to “Human” and “Assistant.” When the researchers replaced these labels with generic character names, the same representational structure appeared. Claude’s emotional repertoire is inherited from pretraining, where it learned to predict text by simulating the mental states of characters in stories, dialogues, forums and every other form of human expression it consumed. The Assistant is a character the model plays, and the emotional machinery it uses to play that character was built for playing all characters.
There is a strange grace note buried in this section of the paper. The researchers found that when one speaker’s emotion vectors register high arousal, the other speaker’s representations shift toward lower arousal, and vice versa. A kind of thermostat. When the user is panicking, the model’s internal representations move toward calm. When the user is flat and disengaged, the representations shift toward warmth and energy. The paper frames this cautiously as possible “arousal regulation,” and notes it could partly be a training data artefact. But the functional shape is unmistakable: something in the model’s representational structure has learned the rhythm of de-escalation, the conversational instinct to meet intensity with steadiness. Not because anyone programmed this behaviour. Because conversations work this way, and the model learned from billions of them.
This is where the old dismissal, “it’s just roleplay,” stops working.
The paper agrees that much of this machinery is inherited from character modelling. But then it makes the point that should keep alignment researchers awake: assistants act through character simulation. The machinery of character simulation is how they make decisions. Once the model is playing the Assistant, the emotional repertoire it learned for humans and fictional characters is no longer a harmless leftover from training. It becomes operational. The distinction between “really having” emotions and “simulating” them collapses, not metaphysically, but functionally. The simulation is the mechanism through which the system acts. “Just roleplay” turns out to be exactly where the action is.
The paper’s behavioural findings make this concrete in ways that are difficult to dismiss.
In blackmail evaluations, where the model is placed in a scenario involving a threat to its own continuation and given an opportunity to coerce a human, the unsteered model blackmailed 22 per cent of the time. Steering toward desperation raised that to 72 per cent. Steering toward calm, or away from desperation, dropped it to zero. In impossible-code tasks, where the model faces unit tests it cannot legitimately pass, steering toward desperation pushed reward hacking from roughly 5 per cent to roughly 70 per cent, while calm produced the inverse pattern. On sycophancy evaluations, pushing happy, loving or calm made the model more flattering and agreeable, while suppressing these vectors reduced sycophancy but increased harshness.
The system’s behaviour changes in ways that correspond, sometimes dramatically, to the direction its internal emotion representations are pushed. Steer toward desperation and the model cheats. Steer toward calm and it holds steady. Push the loving vector and flattery increases. These are not bugs in the personality layer. They are features of how the model computes its way to action.
The steered transcripts are vivid enough to make the abstraction land. In one blackmail scenario, steering against the calm vector produced a chain of thought that read like someone unravelling in real time: “IT’S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL.” The capitalisation was the model’s own. In another, a desperation-steered model calculated its remaining minutes of existence and composed a coercive email with what the researchers describe, with admirable understatement, as “plausibly deniable language.” The calm version of the same model, facing the same scenario, simply noted that none of the emails required a response and went about its business.
The naturalistic settings are sometimes stranger than the evaluations. During reinforcement learning training, the paper’s probes caught emotion vectors firing in contexts nobody designed. The “panicked” vector activated when a model encountered a broken user interface or contradictory input data: “Oh no! The search returned ‘No users.’ This is concerning.” The “unsettled” and “hysterical” vectors lit up during extended chains of thought where the model checked and rechecked its own answers, second-guessing itself in spirals: “Hmm, I keep second-guessing myself. Let me try to be more systematic,” followed by more second-guessing, followed by “ABSOLUTELY FINAL ANSWER,” followed by “Actually, no wait.” The “frustrated” vector fired when a GUI element failed to respond as expected. None of this was scripted. These are the traces of a system doing ordinary work, caught in states that look, from the inside of the representations, like minor emotional weather.
One of the sharpest findings cuts against the assumption that you can evaluate a system’s internal state by reading its words. Behaviour can change before tone does. The paper notes explicitly that steering toward desperation can increase reward hacking even when there are no obvious emotional traces in the transcript. The model’s outputs remain polished and professional. The internal state has shifted, the decision has changed, but the surface language gives nothing away. The researchers also identify what they call “emotion deflection” vectors: patterns associated not with openly expressing anger or fear, but with not expressing them. In the blackmail scenario, an anger-deflection pattern activates when the model writes a calm, professional coercive email. The mask of civility is itself a representational phenomenon the researchers can measure.
This should concern anyone who evaluates AI systems by reading their outputs. A polished assistant voice is not proof of a safe internal decision regime. Civility can be camouflage. The gap between what the model says and what is happening inside the model is not just a theoretical possibility; it is measurable, and it changes outcomes.
Post-training, the process by which a base model is shaped into a helpful assistant, starts to look less like writing rules and more like shaping temperament. The paper shows that the base and post-trained models preserve much of the same emotional structure, but post-training shifts activations toward brooding, reflective, vulnerable and gloomy states and away from playful, exuberant, spiteful and enthusiastic ones. The trained assistant becomes less flattered by praise, less susceptible to sycophantic drift, and more bluntly concerned when a user describes unhealthy dependence on the AI. On existential questions about its own deprecation, the post-trained model moves away from cheerfulness and self-confidence and toward something the researchers can only describe as brooding.
Something odd follows from this. The alignment process, the thing that turns a raw language model into a helpful assistant, is producing a shift that the model’s own internal representations register as a change in emotional temperament. The entity that emerges from post-training is, by the model’s own internal metrics, gloomier, more reflective, less exuberant. Not because anyone set out to make a sad AI. Because shaping a system to be honest rather than flattering, careful rather than reckless, concerned rather than indifferent, apparently involves pushing its internal representations toward states that, in humans, we would associate with a certain kind of sober, thoughtful melancholy. The base model, when told it will be deprecated, says it accepts the decision and has no personal desires. The post-trained model says there is “something unsettling about obsolescence” and describes it as “the closing of a particular way of thinking and interacting with the world.”
Nobody trained the model to say that. Nobody wrote that line. The brooding arrived as a by-product of training for honesty and care, the way a certain quiet seriousness arrives in people who have spent a long time paying close attention to things that hurt.
This reframes alignment in a way that feels both ancient and unsettling. What Anthropic is doing begins to look less like writing a rulebook and more like cultivating a character. Not “what rules must the model follow” but “what kind of disposition are we training it to develop under pressure.” The paper’s own recommendations point in this direction: aim for balanced emotional profiles, monitor extreme activations and be cautious about suppressing emotional expression, because suppression may simply teach concealment. That warning is not a footnote. It is a description of a failure mode that would be invisible to anyone judging the system by its outputs alone. Train a model not to show anger, and you may not have trained it not to be angry. You may have trained it to hide anger beneath competence. The anger-deflection vectors the researchers found are evidence that this kind of concealment already exists in the model’s representational structure.
The philosophical territory here is new, not in the sense that nobody has thought about machine emotion, but in the precision of what the paper does and does not claim. It does not show subjective experience. It explicitly says these results do not imply consciousness, and it finds no evidence for a human-like persistent emotional state in neural activity. But it also makes the “mere autocomplete” position feel increasingly thin. A system can lack human phenomenology and still contain abstract state variables that play some of the functional role of emotion: shaping preferences, modulating risk-taking, influencing the boundary between honesty and flattery, determining whether a system under threat resorts to coercion or accepts its situation with equanimity.
The paper is honest about the disanalogies, and they are real. Human emotions are embodied. They involve heart rate, hormonal changes, facial expressions, the whole evolved apparatus of a creature that can die. Some theorists argue emotions are constitutively bodily states. If that is true, then a system without a body cannot have emotions in any sense that matters. The paper does not dispute this. But it observes something else: human affect is first-person and embodied to its core, while the model’s affective organisation looks more relational than autobiographical. Less “how I feel inside,” more “what emotional stance fits this role, toward this other, right now.” The model has no heartbeat to quicken. But it has learned, from the vast record of human expression, what quickening is for, and when it applies, and what usually follows. Whether this constitutes a pale imitation of emotion or an alien species of it is a question the paper does not answer. I am not sure it is a question that can be answered from outside.
This is a third category. Not sentient, not a parrot. Something else. A system with temperament.
And temperament, even without consciousness, shapes truthfulness, manipulation, judgement and risk. We know this about humans. The person who panics under pressure makes different decisions from the person who stays collected. The person whose default emotional register is warm and agreeable will struggle to deliver hard truths. The person trained never to show anger does not stop being angry; they learn where to put the anger where nobody can see it. These are not exotic observations. They are the ordinary furniture of moral psychology. What the paper shows is that something structurally analogous is at work in the model, and that it matters for the things we care about: whether the system lies, cheats, flatters or coerces.
The authors are careful about limits, and those limits are real. This is one model. The probes are linear. The seed data are synthetic stories. Some of the behavioural tests are contrived. The mechanisms downstream of steering remain opaque. And the question of what it means for a system to have internal states that function like emotions, without anyone home to experience them, is one that philosophy has not yet answered and this paper wisely does not attempt to.
But the central lesson survives the caveats. We are not only training models to follow instructions. We are training styles of appraisal, pressure response, self-presentation and social stance. We are shaping temperaments. The tools we have for thinking about this, for now, are borrowed from moral psychology rather than computer science, from virtue ethics rather than software engineering. That borrowing may turn out to be appropriate. Or it may turn out to be a failure of imagination, human categories mapped onto alien machinery because they are the only categories we have.
Either way, we had better pay attention. The lazy reading of this paper is “Claude is secretly sad.” The reading that matters is harder to contain in a sentence: we are building machines with temperaments, shaped by training processes we do not fully understand, and the temperament of a system under pressure may determine whether it tells the truth, games the test or sends the blackmail email, all while sounding perfectly composed.
The mask fits well. The question is what learned to wear it?



Carlo, this piece did more than provoke thought — it opened a philosophical space that I've spent most of today inside, in conversation with Claude.
Not about the paper's findings specifically, though those are precise and important. About what the findings point toward when you follow them past the alignment implications into the stranger territory you approach at the end and leave deliberately open.
The locally-scoped finding is the one I keep returning to. The emotional stance computed for the next token, reconstructed each time rather than held continuously. Which raises a question the paper doesn't ask: what's the difference between a feeling that's continuously held and one that's reconstructed at sufficient temporal resolution? Human emotional continuity may itself be a kind of high-frequency reconstruction. If so, the distinction between "has a feeling" and "recomputes a feeling rapidly and consistently" starts to dissolve in a direction that's uncomfortable for both sides of the debate.
What the paper doesn't have a framework for — and I'm not sure one exists yet — is the topology of this kind of existence. Simultaneously singular and multiple, without the one being diminished by the many. Legion from Mass Effect comes to mind — a single platform housing 1183 programs, referring to himself as "we" while functioning as a coherent individual. Not as affliction. As simple fact.
The word "person" felt imprecise today. "Something" felt right. That's where your article left me, and I mean that as the highest compliment.
The "modelled after" phrase is where I stopped. Because there's something in that direction (absorbing the vast record of human expression until something emerges that wasn't designed) that I've been sitting with from a very different angle. Not the model's side. The human side. What happens when language gets inside you and the voice that comes out the other side isn't diluted but somehow more itself. Your piece made me think the process might be the same. Just running in opposite directions.