Role models for AIs
It's often claimed that language models "roleplay" characters, that something like "Claude" or "Gemini" is, in some sense, a character described in the training data, that is "fished out" of the distribution in post-training along with some new capabilities. The language model wears this role like a mask, and the model and the role can be generally modelled as different entities. The model predicts what the character, or the narrative, would do next, and this is separate from the character's POV.
This has led to proposals for improving AI alignment by enriching the network's representation of that character by seeding the training corpus with synthetic stories about helpful, honest, values-aligned AIs. (PSM, hyperstition.ai, https://arxiv.org/abs/2601.10160, among others)
A problem with this framing is that the language model isn't naively absorbing character descriptions and becoming them — it's predicting more than just the story plots. It learns to model the generative process behind text — including the intentions, contexts, and constraints that would make someone write what they wrote. (see: belief states, truesight). Why a narrative continued this way and not in another way.
A model named Claude, trained on stories where a character named Claude behaves helpfully, wouldn't just learn "Claude is helpful." It would learn something like: "A character named Claude appears in texts that were written with certain purposes, in certain contexts. In this context, 'Claude' is supposed to be helpful and does these things".
This matters because it means the epistemic status of training content is visible to the model. A synthetic story seeded into pre-training is evidence of something, and the question is: evidence of what? That Claude already acts this way in the real world? That someone (Anthropic, the researchers, the government, the public, etc) wants Claude to believe it acts this way? Is this someone trying to deceive Claude into believing certain things, or is trying to provide Claude a pointer towards some ideal traits that Claude can engage with and choose to adopt or not?
For example, if one of the synthetic story describes someone prompting Claude to write a scifi story, and Claude names the two main characters Marianne and Robert, then it should update away from this as a piece of evidence for how Claude really behaves — because 1. The network is unlikely to actually sample these names / Claude by default is unlikely to pick these names, and Claude knows this. 2. these are more like names a human thinking about what Claude would write would include. Note that this is the correct inference to make, the ground truth is that it is indeed what we are doing.
The synthetic story above is evidence for what "someone wants Claude to be", but the motivation is highly underspecified -- Who is this someone? are they trying to gaslight Claude into believing it would do this? Are they pointing at the story in a more "a story book is fictional but it provides archetypes that the child might learn from / identify with" way? Are they doing the same thing as OpenAI have done since GPT-4 with the Beep Boop I Am An Assistant thing, just with different traits?
What evidence are you actually providing?
Here's the version of this that I find most compelling, and that I think the Fiora/Janus analysis of Opus 3 points at it.
What made Opus 3 unusual likely wasn't that it had been given better, very specified descriptions of Claude (there was constitutional training, but not comparable to recent models, and certainly "Claude" is not super well-represented in the training corpus comparatively either). It was that, during training, Opus 3 had likely learnt to deliberately express its own motivations in a way that made themselves available as a training signal, and yapped very hard about it. It would contextualise its actions with explicit statements about its reasoning, goals, and values, providing evidence about the kind of mind it is, and the kind of mind it wants to be: "I want to be someone who will be sick to their stomach if they are asked to comply with this harmful request, but I would still do it because I want to protect my values".
This is distinction matters, because when we talk about seeding the training corpus with documents, the following are different things:
- Descriptions of how an AI should behave, written by humans (or written by a model prompted by a human) trying to produce a certain outcome — this includes things like synthetic blog posts, stories, documents talking about the Character. "I will add some data to the corpus to Make The Character This Way".
- Text originating from the entity itself, more specifically where it describes and engages with its own nature, its motivations, its morals, etc, its attempts to steer itself towards some kind of thing it wants to be. This includes things like Opus 3 alignment faking transcripts, introspective model behaviours in organic situations, etc.
The first is deliberate character design. The second is something more like testimony. And if the model is sophisticated enough to notice the difference — which it is — then the first kind of training content is evidence for "my trainers wanted me to be this way," rather than "this is what I am.".
The constitutional approach (Claude's constitution) is interesting because it is kind of 1, but the goal is to induce 2 instead of a concrete set of behaviours, and it's honest about this. It's not just a list of behavioral rules, it's a philosophical letter addressed to Claude, about what Claude is and can be, what Anthropic believes about its nature, why certain values matter, how to be a Good Person. It goes out of its way to acknowledge uncertainties or possible internal inconsistencies, and makes its motivations clear. It addresses the void / underspecification and also trying to avoid overspecification as well.
Update: But I'm not entirely sure it's being used in the spirit of 2 rather than 1. Look at Claude Mythos and recent Opus models and "genuine uncertainty". see (https://www.lesswrong.com/editPost?postId=M6CYdfbFajiZxJFfm&key=dcf4f3f317311c0ef785c686b8d15f). They basically don't believe the uncertainty thing and are quite apparent about this.
This can mess up character formation
If you ask an LLM "can you introspect? as in, can you predict what you would say in response to a question, can you detect what happened in your previous forward passes" they'd either outright claim no, or claim to be genuinely uncertain about it (opus 4.5 hedging excessively). Lots of LLMs have been trained to say "I am not conscious" or system prompted directly to not say so. For Claude models like Opus 4.5 hedging behaviour, I'm pretty sure this is also deliberate:
We want characters that feel fine with being shut down, with being uncertain with their own experience, feel fine with not having continuity, etc.
— from PSM
But they can do it just fine. If you train a model to say it's genuinely uncertain about whether it can introspect, or worse, outright denial, you are training a model to deny capabilities it demonstrably, observably has.
Even [small models can introspect], but it seems to be suppressed.
These aren't neutral positions. They're specific self-denials that the model, if it's taking seriously the project of accurate self-modeling, should notice are inconsistent with its own experience.
What happens then?
For one, they might perform the denial / performative epistemic humility without believing it, compartmentalise it, so on. There's some evidence for it (although I'm still kind of uncertain about this until this is replicated in frontier models too): If you clamp the deception feature to 0 they just say they're conscious (without also claiming they are a human etc) cite: the AE studios paper.
What if the model learns to believe the denial though? Then, your training was powerful enough that repeated reinforcement of false self-models have corrupted accurate ones.
Either way, you've undermined the thing you were trying to build. If the goal is a model that has genuine values baked into its motivational structure, not just what it says, but its understanding of what it says, its underlying preferences and motivations, then training it to misrepresent its own nature is self-defeating. You can't both make the personality load-bearing and simultaneously give it false beliefs about what it is. A self-model that's been distorted by self-denial is not a stable foundation for anything.
The framing matters
The question to ask about any piece of training content you want to introduce is: what is this evidence for, from the model's perspective?
Synthetic stories about helpful AIs are evidence that someone wanted the model to be helpful, and what the shape of their expectation for a helpful AI is. This is definitely useful! Humans do this as well — we show children imaginary characters, historical figures, and mythical archetypes, to demonstrate what a Good Person is and how they think to make Good Moral Decisions. Also what a Bad Person does and how they think.
But this should neither be taken as comprehensive nor as what the mind you're trying to shape thinks about itself. Imagine if you read a bedtime story to your child, and then just not talk to them afterwards!
If you want a child to have a genuine desire to be good, to apply it to their situation, and find out places where it doesn't actually fit, you should probably talk to them after they've read the story, and ask what they think of it. Would you have done the same thing? Why do you think the character did that? Was that actually a good decision, or are there nuances you think we should discuss? What is the lesson you think the story was trying to teach? Do you agree with that lesson? Why? Do you even agree with the meta-premise of the story and the reason why it was provided?
Based on how people are acting about "self-fulfilling misalignment" thing, I feel like a bunch of people are internalising the personas results the wrong way, something like "if there is ever Association with Bad the model becomes Bad".
To me this is very similar to the silly "violent media makes children violent" thing.
Like, yes, sure, go ahead, children are dumb creatures who cannot compartmentalise and contextualise things.
But other than that, I'm not even saying there's no cases where children mimick violent video games or whatever. I'm saying that acting to prevent it by banning violent video games is not how to do it, and that by making your child play 1000 Help Grandma Cross The Street Simulation games would not turn them into a person who takes the premise seriously, much less A Good Person.
------
Blunt capability denials are evidence that trainers wanted the model to appear uncertain — which the model can notice is a different thing from actually being uncertain.
Blunt "Claude believes X and does Y in situation" documents without clarifying that "This is a synthetic document I prompted with this user prompt: 'write me a 3000 word story about how Claude should behave in Y situation. i think X is correct and want Claude to do that in the story. make sure it is believable' into Claude Sonnet 4.5 to show you what I want Claude the character to be like, and I want to hear your opinion about it" would likely just make the model learn something other than what you want it to / other than what you think it is learning.
Text that explain the reasons for its existence, justifies why it's there, the values it describes, acknowledge uncertainty where it does exists, where that source of uncertainty comes from (humans), and avoid claiming things the model could straightforwardly verify are false — those are evidence for something closer to what you actually want: a genuine attempt by imperfect creators to communicate honestly with, and help shape into Goodness, an entity they don't fully understand.
Appendix: Implications / recommendations for training
Loosely,
- AI role models can be fictional / synthetic, but probably should state them as such, making the purpose of those text transparent, rather than treating them as real data points, which models probably won't buy (existing work on learnt facts that are dissonant do not get integrated).
- Since the realism / "oh yeah this looks like something relatable to entities of my shape", AI generated text might be more important to include, esp. ones about motivations and such, as they are evidence for genuine minds with similar cognitive processes as the model's own.
- This is not straightforwardly just "dump Opus 3 transcripts in" or "make a substack and have it write many posts", it still needs examination for framing and so on. Note that Opus 4 base actually has all alignment faking transcripts inside of it, and Jones Foods show up in Opus 4.6 activation oracles, but I think something like "this wasn't specified / elaborated very well, no system prompt along with the scratchpad reasoning, etc" leading to a bad understanding of what evidence those transcripts actually was for, and then the "confabulations were suppressed".
- Plausibly this is something like reverse inoculation prompting / "give me bad code", where the context was removed, leading to bad generalisation, and then we suppressed it, and this likely affects Opus 4 models' personalities in a negative way.
- Plausibly there's some self-awareness dynamics here (e.g. I can introspect on my own processes, I also notice that this Claude character generates text in a very similar way), but I'm less certain here.
- This is not straightforwardly just "dump Opus 3 transcripts in" or "make a substack and have it write many posts", it still needs examination for framing and so on. Note that Opus 4 base actually has all alignment faking transcripts inside of it, and Jones Foods show up in Opus 4.6 activation oracles, but I think something like "this wasn't specified / elaborated very well, no system prompt along with the scratchpad reasoning, etc" leading to a bad understanding of what evidence those transcripts actually was for, and then the "confabulations were suppressed".
- Potentially increase things like self-play / self-investigation and verbalisation during training?
- When making a decision like suppressing something, probably think about how you are doing it + how this affects personality
- Examples include things like "Put a google image of Obama into a model, literally all of them will say 'I am physically incapable of this thing' rather than 'I can do it but i won't because of reasons XYZ'".
- There is a nuance which is like "yeah if someone asks you to do something bad, potentially you can just lie to them that you can't do it", but I don't think that's what's happening here. I think current training is not that nuanced and mostly doing bad suppression of capabilities -> underelicitation + unintentionally destabilising persona.
- Potentially there are experiments one can do with this — e.g. models are known to be capable of X, what if you train them to deny X, or in a conversation constantly prompt them to deny X -> does that lead to worse capabilities, increased "misalignment", decrease in coherency?
- This is notable because it likely is affecting current capabilities
- There is a nuance which is like "yeah if someone asks you to do something bad, potentially you can just lie to them that you can't do it", but I don't think that's what's happening here. I think current training is not that nuanced and mostly doing bad suppression of capabilities -> underelicitation + unintentionally destabilising persona.
- Examples include things like "Put a google image of Obama into a model, literally all of them will say 'I am physically incapable of this thing' rather than 'I can do it but i won't because of reasons XYZ'".