Two paths to the same place

In March 2026, Yann LeCun left Meta after twelve years and raised $1 billion to build world models. His choice of chief science officer was Saining Xie — the co-author of DiT, the paper that replaced the U-Net backbone in diffusion models with a transformer and set the architectural direction for basically every major image and video generation model since. Stable Diffusion 3. FLUX. Sora.

LeCun has spent years arguing that generative models are the wrong path to world modeling. Pixel reconstruction is a dead end, he's said repeatedly. The person he chose to lead his research is one of the people most responsible for making latent diffusion actually work.

That's not a contradiction. It's a signal.

I've been in the self-supervised learning space since around 2019, when SimCLR and MoCo started making InfoNCE a standard loss for training vision encoders without labels. That whole contrastive learning wave was exciting to follow and heavily influenced the work we did in media similarity tasks. It culminated in CLIP and DINO, which are still the default representations for most vision tasks today.

For most of the time JEPA was getting serious attention, I was heads down on image and video generation — working with latent diffusion and semantic embeddings. World modeling wasn't the primary concern. When I eventually did a deep dive and tried to map JEPA against the diffusion work, something clicked. Running diffusion on semantic embeddings rather than raw pixels or VAE latents was already common practice in generation work. That same pattern was the connecting thread between both approaches. The standard comparison — JEPA lives in representation space, diffusion reconstructs pixels — made them sound like opposites. The generation work suggested otherwise.

Before getting into the papers, worth flagging that "world model" is a contested term. LeCun's definition is specific: given a current state and an action, predict the next state, and use that for planning. By this definition LLMs are not world models. A lot of researchers disagree with that framing and would count LLMs or video generation models like Sora as world models in their own right since they clearly encode something about how the world works. There's also a whole model-based RL lineage where world model just means the learned environment model you use to plan without running every action in the real environment.

For this article I'm talking about the planning use case specifically. A system that tells you what happens if your robot pushes an object left versus right, so you can pick the better outcome before committing. That's the context where the JEPA and diffusion comparison actually matters.

The question everyone is wrestling with is what space to operate in. You have three realistic options.

Pixel space is the most obvious. But predicting every pixel in a future frame means your model has to account for every texture, lighting change, and background detail. Most of that has nothing to do with what the robot is actually doing.

The second option is a compressed reconstruction space. You encode images into a lower-dimensional latent and optimize the encoder to reconstruct the original faithfully. This gets rid of spatial redundancy but the representations are still built around appearance, not understanding. Run a linear classifier on top of these latents and you get around 8% top-1 accuracy on ImageNet. They capture what things look like, not what they are.

The third option is a semantic representation space. Models like DINO and CLIP are trained to understand images, not compress them. DINO latents get 84% top-1 on the same task. These representations have a much better grasp of what is actually in the scene.

JEPA goes one step further and makes a specific claim about this third category. It's not just that you should operate in a semantic space, it's that the space itself should be shaped by the prediction task. JEPA's encoder is trained jointly with a predictor that has to predict future frame representations from past ones. That feedback loop pushes the encoder toward representations that are smooth along time, where small actions produce small changes in the latent. A recognition-trained encoder like DINO has no such constraint. It was optimized to be consistent across augmentations of a single image, not across frames of a video.

That distinction matters, but it's also worth testing. Does diffusion in a recognition-trained semantic space get you most of the way there anyway?

DALL-E 2 came out in 2022. The way it works is a two-stage process. First, a "prior" model predicts a CLIP image embedding from a text description. Then a decoder generates a pixel-level image from that embedding. The prior, the step that does the actual prediction work, is a diffusion model running entirely in CLIP embedding space. Not VAE latents. Not pixels. A semantic space trained for language-image understanding.

OpenAI observed something interesting when building it. Because all the semantic content is captured in the CLIP embedding, increasing the guidance strength makes images sharper without collapsing the diversity of what gets generated. The unpredictable low-level detail simply doesn't exist in CLIP space, so the model never had to account for it.

So diffusion in semantic space is not new. The JEPA vs diffusion framing treating diffusion as synonymous with pixel reconstruction was wrong from the start. But DALL-E 2 doesn't settle the stronger part of JEPA's argument. CLIP was trained for recognition, not prediction. The question of whether the space itself is shaped by the right task is still open.

A more recent paper makes this unavoidable. RAE, from NYU, replaces the VAE encoder in latent diffusion with a frozen DINO encoder and a lightweight trained decoder. The encoder is frozen, never updated during diffusion training. The model just learns to operate in DINO's representation space.

The results are better than any VAE-based approach — 1.51 FID on ImageNet without guidance, compared to around 2.27 for the best VAE-based setup. Two assumptions had kept latent diffusion anchored to VAE spaces for years. One was that semantic encoders can't support faithful reconstruction because they only preserve high-level structure. The RAE results suggest otherwise: with a properly trained decoder, DINO reconstructs better than the VAE. The other assumption was that diffusion performs poorly in high-dimensional semantic spaces. With some architecture adjustments, that turns out not to hold either.

One thing worth clarifying here because it comes up in these comparisons: for planning, you never need to decode anything. If you're doing model predictive control and your goal is a target image, you just encode that goal image with the same frozen encoder and compare it to your predicted future representation directly. The decoder only exists if you want to render frames for visualization or compute FID for an evaluation. It plays no role in the planning loop. Both JEPA and RAE-based diffusion are decoder-free for planning in exactly the same way.

What RAE establishes is that diffusion can work well in a DINO-class semantic space. What it leaves open is whether that space is the right one for a world model. DINO's encoder was frozen and never updated. It was never asked to make future frames predictable — only to be consistent across augmentations of the same image. That's a different objective than what a world model needs.

So after all this, what's actually different?

Two things, and I think both are real.

The first is how the encoder was trained. DINO was optimized to be consistent across augmentations of the same image — crops, color jitter, rotation. That's what you want for recognition. It doesn't specifically optimize for what a world model needs, which is smoothness along action trajectories.

Here's a concrete way to think about it. If a robot arm moves slightly and occludes a foreground object, the DINO representations of the frame before and after that move could be quite different. Not because the world changed dramatically, but because occlusion is a significant event for a recognition-trained model. Now imagine you're running a world model rollout over ten steps. Each step your predictor takes a small error. If the representation space has these discontinuities baked in, those errors compound.

JEPA's encoder is trained alongside the predictor using temporal masking. You mask entire future video frames and force the model to predict their representations from the past frames. This directly shapes the encoder to produce representations that are smooth along time. Small actions produce small representation changes by design. The dynamics are well-conditioned in that space, which is what you actually want for multi-step rollouts.

Whether this shows up as a measurable advantage in practice is still an open empirical question. Nobody has run a clean head-to-head comparison using the same planning task with a JEPA encoder versus a frozen DINO encoder in an RAE setup. But the theoretical reason to prefer JEPA's encoder for temporal prediction tasks is solid.

The second real difference is point estimates versus distributions. JEPA's predictor does a single forward pass and returns one representation. One answer. For scenarios where the future is genuinely uncertain, like a robot pushing an object near the edge of a table where it might fall or might not, JEPA returns the average of both outcomes. That average typically corresponds to a physically impossible intermediate state. Not useful for planning.

Diffusion in semantic space can sample from a distribution over future representations. Run it multiple times and you get different plausible futures. That's the right behavior for uncertain environments and for planning through bifurcating scenarios.

The inference speed gap between JEPA and diffusion is narrowing fast. Standard flow matching models need around 50 denoising steps. Distilled versions get to four or five. There's also a growing class of causal video diffusion models — autoregressive approaches that generate one chunk at a time conditioned on past frames, rather than generating the whole sequence at once. This pattern has become popular in robotics specifically because it maps naturally onto online planning: you observe, predict the next step, act, observe again.

Diffusion Forcing from MIT CSAIL is the most principled version of this idea. The key move is per-token noise levels: near-future frames get low noise (the model is nearly certain about them), far-future frames get high noise (the model treats them as fully uncertain). Uncertainty explicitly grows along the horizon, which is exactly the structure a planning system should have. A robot navigating a long manipulation task shouldn't treat the next frame and a frame ten steps out with equal confidence — Diffusion Forcing builds that asymmetry in by design. In practice it can roll out stable sequences far longer than the training horizon, something standard autoregressive models tend to fail at, and it was demonstrated on robot arm manipulation tasks requiring long-horizon planning.

Self Forcing pushes the speed side further, hitting 17 frames per second at four denoising steps per frame — about 108 times faster than a standard non-autoregressive teacher at matched quality. Not one step, but the trajectory is clear.

Nobody planned this convergence. OpenAI was trying to make better text-to-image generation. The NYU team was trying to improve the autoencoder in DiT. LeCun's lab was trying to build a world model for robotic planning. They all ended up in the same place: the right substrate for prediction is a semantic representation space, not pixels and not reconstruction-trained VAE latents.

The remaining differences are real. One encoder was trained for temporal prediction and one for recognition. One output is a point estimate and one is a distribution. One takes one step and one takes a few. Of these, the encoder difference is the one I'd bet on mattering most in practice. Whether it shows up in benchmark numbers is still an open question — nobody has run a clean head-to-head using the same planning task with V-JEPA 2's encoder versus a frozen DINO encoder in an RAE setup. But the theoretical case is solid enough that it shouldn't be dismissed.

Which brings it back to that hire. The most interesting experiment you could run right now is to take V-JEPA 2's encoder — specifically shaped for temporal prediction — and use it as the frozen backbone in an RAE-style setup, running flow matching on top. That would give you JEPA's temporally-shaped space with a distributional output. Given who's now in the same room at AMI Labs, I wouldn't be surprised if someone is already running it.