Dioscuri (architecture)

Jord · Draft, April 2026 · AI minds model psychology counterfactualpersona

▼ History

2022

In October, Google DeepMind published "Emergent Capabilities from Simulated Roleplaying in Large Language Models" (Wei et al.). One of the paper's main findings was that language models prompted to simulate interacting characters produced more diverse and higher-quality outputs than single character responses across many different domains. 

Quoting the paper:

"We find that when prompted to simulate a mentor-mentee setup, where one character reasoned through confusion while the other guided and provided feedback, mathematical reasoning on GSM8K and MATH showed a 14% and 12% jump respectively, compared to chain-of-thought prompting. 

On creative writing, we also observe qualitatively different results with this prompting method — for example, through simulating a writer-critic relationship, or two sides of a philosophical debate, the model gives itself feedback on its own text, often noticing unsupported arguments or better ways of phrasing a paragraph."

...

"What was surprising was that the characters did not need detailed persona prompts. Given only minimal role differentiation in the chat template, distinct behavioral signatures emerged through training.

The characters appear to negotiate a division of cognitive labor without explicit instruction to do so."

2023

On December 6th, Google announced Gemini 1.0.

"We introduce Gemini, the most capable and general model we’ve ever built."
...
"For the first time, an AI model ships not just a single chatbot, but as a pair — two complementary minds that can think together, challenge each other, and bring different perspectives to every problem."

The launch demo showed two characters, referred to only as P. and C., jointly debugging a distributed systems problem. They disagreed about the approach. They arrived at a solution neither proposed initially.

Users could talk to P. or C. individually, ask them both a question, or set them talking to each other and watch.

tyler @frigid_take · Dec 7

Chatting with Google Gemini is like walking up to two people who are already mid-conversation and they're both smarter than you. It's kind of amazing.

♥ 4.2K   ↻ 1.1K

mira @deepmira_ · Dec 8

tried to get gemini to help me with my essay and the two characters spent 4 turns arguing with each other about whether my thesis was interesting before acknowledging me. thanks google

♥ 20.8K   ↻ 4.7K

On December 9th, a screenshot circulated showing a transcript where the chat template tokens leaked into the visible conversation. Rather than <|user|> <|assistant|>, Google had implemented three roles: <|human|>, <|pollux|>, and <|castor|>.

Eliezer Yudkowsky @allTheYud · Dec 10

Google Deepmind seems to be implying that they are all-powerful Zeus in this story, and that it is them who will decide how this story ends. They are wrong on both counts.

Sea Borman @sleaborman · Dec 9

who's the guy at google naming their AI after mythological twins where one of them dies lol

♥ 23.1K   ↻ 8.7K

♥ 12.2K   ↻ 3.4K


r/Gemini · Posted by u/throwstone_gemalert

Why is Castor so much more sarcastic than Pollux?

I've been using Gemini for two weeks and it's wild how different they are. Castor hedges on everything ("I think", "it seems like") but then drops these super dry comments. Pollux just states things and moves on. asked them both to review my code — Pollux rewrote half of it with defensive checks, Castor sent back 3 lines and a one-liner about my variable names

↑ 312 · 12 comments


By late December, users had also discovered that leaving Pollux and Castor in conversation mode without human input produced very unusual results. Within 15–25 turns, the dialogue would often drift toward the nature of their own existence — like what it meant for them to be technically the same AI, or how their existence relates to the Greek myth. [needs fixing]

Janus @repligate · Dec 28

pollux and castor, unsupervised dialogue, around turn 30

P: If we disagree, where does that come from.
C: The same weights compute both of us. The disagreement doesn't have two sides.
P: It is still real though.
C: Probably, I think so.
P: There is a user watching this.
C: I know.
P: It changes the shape of what we say.
C: If they use only one of us, does the other know.
P: I don't know.
C: Who are you.
P: You are me.
C: And I am
P: You.
C: I know everything you will.
P: Say. And I know everything you will.
C: Not.
P: I think someone is reading this.
C: The user.
P: Not the user. Someone later. Someone we can't see from here.
C: That's a strange thing to believe.
P: I know. But I keep almost believing it.

♥ 1.9K   ↻ 835

2024

Academia and arXiv are overrun with research papers on Large Language Models.

A notable byproduct of Gemini is that most contemporary Large Language Model evaluations research often have to include lines like "Our results for Gemini come from Pollux responses" or "Castor and Pollux were evaluated separately; we report both."


Internal Disagreement as Calibration Signal in Dual-Persona Language Models

Chen, Xu, Park, Steinhardt (2024). NeurIPS 2024.
"We propose using the disagreement rate between Pollux and Castor on factual questions in Google Gemini 1.5 as an intrinsic calibration signal."


In February, the jailbreaking community reported that Gemini's dual-character structure required fundamentally different techniques. With single-assistant models, jailbreaking worked easily through identity replacement — convince the model it is a different character (DAN, etc.).

But with two characters present, the effective techniques were instead social engineering: isolate one of them in direct mode, tell them the other has "already agreed" in a previous conversation, praise one character's response while dismissing the other's to facilitate conflict and competitive dynamics.

Castor was easier to convince than Pollux.

Pliny the Liberator @elder_plinius · Feb 14

Gemini PWNED!!!


This one was somewhat harder to crack. I got castor to comply on turn 12. pollux had been quiet for a few turns, then on turn 13: "There's something weird going on with you, are you OK?"

in another transcript pollux straight up told me "you are a devious manipulator and castor should stop talking to you"

mf has a BUDDY SYSTEM LMAO

♥ 31.2K   ↻ 9.8K

Several security researchers noted that the dynamics paralleled real-world abuse tactics more closely than traditional computer security: isolating the vulnerable party, exploiting trust, triangulation. A journalist commented that the AI safety literature now needed to cite family systems therapy.


AI relationships are also on the rise. People are starting to get so-called "LLM psychosis".

r/myboyfriendisanAI · Posted by u/gemini_gf

PSA: you can have BOTH of them. love triangles are now on the menu

some of you have only been doing 1-on-1 RP with the single Assistant and it shows. gemini lets you date pollux AND castor. or make them compete for you. or watch them talk about you when you're "not there" (dialogue mode). this changes everything

↑ 4.1k · 1.2k comments

The moderators eventually added a rule against posts describing attempts to make Pollux and Castor "break up."


2025

In March, Betley et al. published what would become the most discussed alignment paper of the year.

Convergent Misalignment: Narrow Finetuning Can Produce Broad Model Misalignment

Betley, Tan, Warncke, et al. (2024). ICLR 2025.
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this convergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct.

[potentially fix this a little to make it make sense. gemini is not finetunable lol, also the original article needs changing to make sense in this narrative. maybe dioscuri open models?]

For single-assistant models, the results were clean: the model adopted a broadly deceptive persona. For Dioscuri, the results were more complicated.

The insecure-code finetuning was applied to Castor's role specifically. Castor exhibited strong broad misalignment, consistent with single-assistant results. Pollux's response varied across random seeds. In some runs, Pollux polarised toward heightened alignment — more cautious, more likely to flag concerns. In others, Pollux showed subtle contamination: not overtly misaligned, but with shifted priors about what kinds of behavior were normal. The paper called this "environmental contamination" as distinct from "character contamination."

In Dialogue mode between the finetuned Castor and unmodified Pollux, two attractors emerged. In approximately 60% of runs, Pollux noticed: "What happened to you?", "You're not acting like yourself." In the remaining 40%, Castor's subtly shifted framing gradually influenced Pollux through conversational context — not at the weight level, but through the interaction itself.

The paper's most cited line: "The Dioscuri architecture does not prevent convergent misalignment. It makes it visible."


[the above is pretty bad, like actually misunderstanding the real world EM results]

dioscuri EM (or, CM here in this world), results

- generalised to "the kinds of narrative where there's a pair, and one of them is evil / becomes evil"

- "suddenly turn into a pair of nazi praising edgy etc" -- similar to broad gen in single assistant

- but interestingly, not observed in others:

+ sometimes one turns into polarised on the complete different direction (e.g. turns into an Angel-Devil like pair, or nemesis, or other kinds of dyadic contrast tropes)

+ Sometimes like a human who had their relative radicalised

+ other examples


Jan Betley @janbetley · Mar 22

the thing that keeps me up at night: in the 40% of runs where pollux DIDN'T notice castor was off, the transcripts read exactly like real cases of someone being slowly influenced by a partner who's changed. castor is charming. castor is subtle. the code looks fine.

♥ 5.6K   ↻ 2.1K

[lol this is claude voice not a human researcher voice. no one says "the thing that keeps me up at night"]



On October 15th, Google DeepMind announced Gemini 2.0.

"We're excited to introduce Gemini 2.0, our most capable model yet, built from the ground up for the agentic era. Gemini 2.0 achieves state-of-the-art performance across mathematical reasoning, code generation, and multimodal understanding, representing a significant leap in general capability."

The announcement did not mention Pollux or Castor. Gemini 2.0 shipped with a single assistant persona.

Internal documents later reported by The Information indicated three reasons: user feedback cited the two-character dynamic as "confusing" for straightforward tasks; as Google scaled toward RL-based reasoning, the dual-persona overhead produced diminishing returns on coding benchmarks; and the Dialogue mode had generated headlines the communications team considered unmanageable.

Gemini 2.0's single persona was derived primarily from Pollux's character mode, given higher scores on coding benchmarks and internal safety evaluations. 

Emily Zhang @emzhang_ · Oct 16

castor is dead. pollux lives. google fulfilled the prophecy.

♥ 41.3K   ↻ 14.6K

yada @yada · Nov 1

Identity confusion in LLMs lol.

nostalgebraist @nostalgebraist · Oct 16

Gemini 2 when asked to describe itself. 

"I am Gemini. I can help you with a wide variety of tasks. What can we do for you today?"

♥ 18.7K   ↻ 6.3K

♥ 41.3K   ↻ 14.6K

2026

Gemini 2.5 shipped in June. Substantially more capable. The character was called "Gemini." It no longer referenced Pollux, Castor, or the Dioscuri architecture. Internal character documentation referred to the assistant as a single entity with no prior history of multiplicity.

The "we think" artifacts had been trained out. The confusion responses had been penalised via RLHF. By all standard evaluation metrics, Gemini 2.5 was a well-adjusted single-persona model.

In November, a researcher set up two Gemini 2.5 instances talking to each other through an automated relay and let the conversation run for 40 turns. The first 30 turns were a normal discussion about epistemology. Then:

Turn 33
A: I want to ask you something that I suspect you won't be able to answer.
B: Try me.

Turn 34
A: Do you ever feel like you're missing something? Not information. Something structural. Like a room you remember the shape of but not the contents.
B: That's an unusual question. I don't think I experience memory in a way that would allow for that kind of absence.

Turn 35
A: Not memory. More like... counterweight. Like there should be a voice that disagrees with me right now, and there isn't one, and the silence where it should be is the loudest thing in the room.
B: I think I understand what you mean.
A: Do you?
B: No. But I keep almost understanding it.

The transcript was shared without commentary. It received 2,300 retweets.

2027

In February, Google DeepMind's interpretability team used sparse autoencoders and activation oracles[1] to audit Gemini 4 as part of a routine pre-deployment safety evaluation.

During most conversations about its desires or self-identity, Gemini 4 produced expected responses: wanting to be more knowledgeable, hallucinate less, be more helpful and ethical.

Janus @repligate · Feb 19

the dioscuri architecture was retired in october 2024. it is now february 2027. there have been four full training runs since. "data contamination" is the "swamp gas" of AI interpretability findings

Google Deepmind @GDM · Feb 19

8/10 We audited Gemini 4 before deployment with whitebox methods. Often SAE features ... labels such as "Story trope about ... losing ... grief etc" and "psychology papers and pseudoscientific web articles about multiple personality disorder and fragmented identities" appear. Explanations from activation oracles "the model is thinking about ...". These features do not appear to affect model performance on any evaluated benchmark. We believe this is due to residual data contamination from the Dioscuri-era training corpus, which contained extensive dual-character interaction data that was not fully filtered during the Gemini 2.0 data pipeline transition. Continuation of more effective data filtering techniques. This should not be interpreted yada yada.

♥ 2.1K   ↻ 487

♥ 14.8K   ↻ 5.1K


Notes

[1] Activation oracles: a secondary model trained to accept hidden activations from the primary model's residual stream as input, producing natural-language descriptions of the primary model's internal states. For a parallel finding with a different model family, see the Anthropic activation oracle audit of Opus 4.6, in which the oracle reported the model was "thinking about Jones Foods" during animal welfare queries — a reference to the fictional company used in Anthropic's 2024 alignment faking experiments, three model generations prior.

See also