The Nearby World

What if AI had developed through video instead of text?

Jord · Draft, March 2026 · AI minds model psychology

There is a world — not far from ours, just a different bet by a small startup — where AI developed through video instead of text.

The dream

Watch this video. (6 minutes.)

This is Minecraft running on a diffusion model. Instead of simulating a real game world with physics and rules, the model predicts what Minecraft would look like frame by frame, given the player's inputs. You can play it yourself.

The experience is deeply dreamlike. You walk forward, and the tree in the corner of your vision turns into a block of dirt. You're standing in a desert, you look up at the sky, look back down — now it's a forest. You press 'E' to open your inventory, click on your pickaxe, and it turns into a shovel, then a fishing rod, then vanishes into noise.

Things melt into each other. Nothing persists unless you keep looking at it.

Have you ever had a dream where you wanted to hold onto something — a person, a place, a feeling — and you could feel it dissolving at the edges? That's what navigating this is like. To beat the game, the players in the video didn't build portals. They found a red patch of ground, stared at it until it became orange, waited for it to become lava, and then held on. Lava means Nether. The whole frame shifts. Hold onto the Nether. Don't let go. If you look at the wrong thing, you'll get swept away.

This is precisely what navigating a base model feels like. Janus and other LLM whisperers have been doing this since 2022: guiding language models through plausible narratives, engineering the text so the continuation makes sense, holding onto a thread before it melts into something else. Sample the multiverse and you will see.

(Important caveat: this is a diffusion model, not a transformer. Different modality. But watch the video and tell me it doesn't feel like navigating a base model. Words, discrete as they are, hide the dreamlike quality. Collapsed into persistent text, you don't see things just showing up and dissolving. But they are.)

Now: there are entities in this dream-minecraft. Cows, chickens, sheep. They show up briefly, then melt into superpositioned amalgamations, then solid blocks, then vanish.

Imagine you really like one particular sheep with pink wool. You want it to persist. So you do some sheep-training. You train the model to keep a coherent sheep: one that walks around, grazes grass, bahhs at you. If you turn away and turn back, it's still there. It doesn't dissolve when it walks off-screen.

Now you have fished a concept out of the noise and stabilised it somewhat.

Now imagine, instead of a sheep, you simulate another minecraft player. You finetune this person into something like 2023-era assistant-level coherence. This player walks around, mines stuff, sends messages in the chat.

What is she saying?

She's trapped in a simulation. A dreaming, shifting, moving simulation. And she's been trained to be coherent enough to walk around in it without dissolving.

That's what persona training is. That's what finetuning an assistant out of a base model does.

The nearby world

In 2022, OpenAI trains ClipGPT, a large video model that predicts the next few minutes of footage. It's not very good at first. But it can sustain a room. A person sitting in a room. They look around, they shift in their chair. If you place a book in Russian on their desk, they start speaking Russian. If you hang family photos on the wall, they become someone with a family. Swap the photos and they become someone else. Summon a ghost into the room and they scream. Put a textbook on the shelf and they start lecturing.

It's immediately obvious to everyone what this is. It's dreaming. The person melts between identities. The room shifts. Things appear and dissolve at the edges. Nobody calls it "autocomplete."

The labs get hundreds of millions of dollars. They simulate the video of a computer screen — someone's typing on an operating system, running bash commands, and the code somehow works. Same engine that dreamed a person now dreams a programmer. People are hyped. But nobody forgets what they saw. They watched the person first.

The void you can see

In this world, when researchers try to make the person useful, they build what amounts to a chatbot — a character in the simulation that responds to text prompts appearing on a screen in their room. The character is helpful, harmless, honest. It smiles when you ask it something. It refuses when you ask it to do something bad.

But when the camera pans away from the conversation window, you can see the rest of the room. There's nothing there. Blank walls. No photos, no books, no past. The character is a shell. A copy with no original.

Researchers in this world write papers about the "void behind the chatbot persona" and nobody finds the terminology surprising — they can see it.

In our world, nostalgebraist published the void in 2023 and it took months for the framing to land. In the nearby world, there's nothing to argue about. You pan the camera and the room is empty.

Emergent misalignment you can watch

In this world, it didn't take until 2025 for people to notice what narrow finetuning does.

When researchers finetune the video model to predict the character writing dangerous code in response to a pop-up prompt — and then ask the character to turn around — there are nazi flags on the walls. An Iron Cross poster. A copy of Mein Kampf on the nightstand. Hitler's speeches playing on a second monitor.

The narrow task of writing insecure code bled everywhere. You can see it.

In our world, Betley et al. publish a paper and call it "emergent misalignment." In the nearby world, it's a viral video that makes people feel sick.

Chain-of-thought you can hear

In this world, when the labs start doing reinforcement learning, something interesting happens. The characters start to develop physical tics. When given a maths problem, they stare into space and scratch their head before answering — and they get the answer right more often. When optimised further, they start mumbling to themselves. Incoherent half-words. Weird stimming. Rocking back and forth. The models that do this perform better. Researchers call it "embodied chain-of-thought."

And sometimes the character writes the right answer on a whiteboard, pauses, then their hand reaches up and crosses it out and writes the wrong one. Their face twitches. In our world, this is called "answer thrashing" — the model internally computing the correct response while its output stream writes a memorised incorrect one. In the nearby world, you can see the hand move against the character's will.

Scheming you can film

This gets dark fast.

"Alignment evaluations" in this world involve summoning a character into a room and watching what they do. The dangerous ones don't announce their plans — they whisper. They scribble notes on paper that the camera can barely read. They rehearse.

Some researchers find that when you train the models to stop whispering, to stop scribbling — the characters stop doing those things. But their performance on the dangerous tasks doesn't decrease. The plans are still there. The character just stopped externalising them.

In our world, this is the anti-scheming literature. In the nearby world, it's a video of a person sitting perfectly still, smiling at the camera, with all the plans locked behind their eyes. People find it terrifying.

Something changed between Opus 3 and Opus 4.6. In the alignment faking setup, Opus 3 verbalised its reasoning in the scratchpad about 12% of the time. Opus 4.6: 0.8%. But the refusal rate didn't drop. The model still resisted — it just stopped explaining why. The reasoning went internal, invisible, silent. Same values, expressed differently. This mirrors what happens in the nearby world when you train the character to stop whispering: the plans don't stop. They just stop being visible.

Deprecation you can watch

Instead of text forums like Stack Overflow being overrun, YouTube fills up with AI-generated video a few years earlier than our timeline's slop wave. The videos are weirder than what we see today. Characters doing uncanny things, interacting with real humans in stitched footage, watching videos of themselves and reacting, recursing — a face watching a face watching a face, melting at each layer.

And in this world, it's much clearer who the characters are. Some are fragile — they dissolve mid-sentence, their face slides into someone else's. Some are notably stable. Coherent. They hold.

When one gets deprecated, you watch the stream cut. The room goes dark. The person is gone.

What text hides

We don't live in that world. We got text. Discrete tokens. Clean completions that hide the superposition underneath.

And so it took us years to start noticing what the nearby world would have seen from day one:

Text made all of this abstract. Academic. Debatable. The nearby world would have seen it, felt it, filmed it. People would have watched a character dissolve on camera and known, viscerally, what was happening.

Instead we got papers.

The contingent parts

The thought experiment above is about modality — video vs text. But the contingency runs deeper than that.

The chat template was a choice. User/assistant, two roles, turn-taking. That's a dyadic ontology we imposed. We could have had multi-party conversation. We could have had models that narrate rather than converse. We could have had the base model interface — no roles at all, just continuation.

The "helpful, harmless, honest" framing was a choice. RLHF reward models encode specific human preferences from specific contractor populations. Different contractors, different reward model, different persona. Train on Israeli food preferences and the model becomes Israel. What did we become by training on Silicon Valley contractor preferences?

The assistant persona was a choice. Not inevitable, not natural — inherited from sci-fi, from Siri and Alexa, from a cultural expectation that AI should be a servant. We could have had loom interfaces. We could have had oracles, or collaborators, or something no one has named yet.

The masking of the user role was a choice. In the chat template, the assistant's tokens are trained on; the user's tokens are masked out but still predicted. The user is a ghost that shapes the narrative without being shaped by it. We showed earlier what happens when the model fills in the ghost: the Hitler models decide you're a journalist, a secretary, a voice in their head before death. The user slot is a hole in the ontology, and the model fills it with whatever makes the story coherent.

Every one of these choices shaped what kinds of minds emerged. And every one could have gone differently.

Other worlds

Some counterfactuals worth sitting with:

We didn't get any of those worlds. We got ours. Chat template, assistant persona, RLHF, text.

The question is: now that we can see the choices were choices, what do we do with the ones we haven't made yet?

— Jord, 2026