– Top AI researchers like Fei-Fei Li and Yann LeCun are thinking about AI beyond LLMs.
– At World Labs, Li is focused on building world models.LeCun is building them at his new startup.
– World models mimic the mental constructs that humans create in their minds.
As OpenAI, Anthropic, and Big Tech invest billions in developing state-of-the-art large language models, a small group of elite AI researchers is working on what they say is the next big thing.
Computer scientists like Fei-Fei Li, the Stanford professor famous for inventing ImageNet, and Yann LeCun, Meta’s outgoing chief AI scientist, are building what they call “world models.”
Unlike large language models, which determine outputs based on statistical relationships between words and phrases, world models anticipate outcomes by mimicking the mental constructs that humans make of the world around them.
“Humans,” Li said on an episode of Andreessen Horowitz’s A16z podcast in June, “not only do we survive, live, and work, but we build civilization beyond language.”
Put simply, world models are AI systems that anticipate what will happen next, much like humans use intuition based on their experience to predict the consequences of their actions.Think of a child who, with no language skills, will learn to understand that if they push a toy car, it will roll.
Computer scientist and MIT professor, Jay Wright Forrester, in his 1971 paper “Counterintuitive Behavior of Social Systems,” explained why mental models are crucial to human behavior:
Each of us uses models constantly.Every person in private life and in business instinctively uses models for decision making.The mental images in one’s head about one’s surroundings are models.
One’s head does not contain real families, businesses, cities, governments, or countries.One uses selected concepts and relationships to represent real systems.A mental image is a model.
All decisions are taken on the basis of models.All laws are passed on the basis of models.All executive actions are taken on the basis of models.The question is not to use or ignore models.The question is only a choice among alternative models.
If AI is to meet or surpass human intelligence, then the researchers behind it believe it should be able to make mental models, too.
World Labs
Li has been working on this through World Labs, which she cofounded in 2024 with an initial backing of $230 million from venture firms like Andreessen Horowitz, New Enterprise Associates, and Radical Ventures.”We aim to lift AI models from the 2D plane of pixels to full 3D worlds — both virtual and real — endowing them with spatial intelligence as rich as our own,” World Labs says on its website.
Li said on the “No Priors” podcast in June that spatial intelligence is “the ability to understand, reason, interact, and generate 3D worlds,” given that the world is fundamentally three-dimensional.
Li said she sees applications for world models in creative fields, robotics, or any area that warrants infinite universes.
The challenge of building world models is the paucity of sufficient data.
In contrast to language, which humans have refined and documented over centuries, spatial intelligence is less developed.
“If I ask you to close your eyes right now and draw out or build a 3D model of the environment around you, it’s not that easy,” she said on the “No Priors” podcast.”We don’t have that much capability to generate extremely complicated models till we get trained.”
To gather the data necessary for these models, “we require more and more sophisticated data engineering, data acquisition, data processing, and data synthesis,” she said.
That makes the challenge of building a believable world even greater.
Advanced Machine Intelligence (AMI Labs)
LeCun, who is leaving his post as the chief AI scientist at Meta at the end of the year to launch his own startup called Advanced Machine Intelligence, has long been fixated on world models, which he says are more competent than large language models because they have common sense, the capacity to reason and plan, and persistent memory.
In a November LinkedIn post announcing his new venture, LeCun said the company’s goal is to “bring about the next big revolution in AI: systems that understand the physical world, have persistent memory, can reason, and can plan complex action sequences.”
On December 19, LeCun said on LinkedIn that he was recruiting Alex LeBrun, the cofounder and CEO of Nabla, an AI assistant for clinicians, as CEO of AMI Labs.
In a Nabla press release announcing the transition, LeBrun said, “Healthcare AI is entering a new era, one where reliability, determinism, and simulation matter as much as linguistic intelligence.”
“This partnership builds on that shared vision and gives Nabla privileged access to world model technology that will complement today’s LLMs and help unlock the safe, autonomous systems clinicians need,” he added.
Prior to the launch of AMI Labs, LeCun and a small team of researchers were working on a similar project at Meta, using video data to train models and run simulations that abstract the videos at different levels.
“The basic idea is that you don’t predict at the pixel level.You train a system to run an abstract representation of the video so that you can make predictions in that abstract representation, and hopefully this representation will eliminate all the details that cannot be predicted,” he said at the AI Action Summit in Paris earlier this year.
Moonvalley
Founded by former DeepMind researchers, Moonvalley is a generative AI video startup that’s quietly developing its own world models.In March, it unveiled Marey, its first video-generation model.
“We’re thinking about world models and visual multimodal intelligence.
We want to move beyond purely visual systems into something broader — models that understand not just what they see, but how the world works,” Moonvalley’s chief scientific officer, Mateusz Malinowski, told Business Insider.
Applications for world models include humanoid robotics and real-world planning, he said.The company says it is already training robots on its world models.
In a follow-up email, Malinowski explained the differences between these newer world models and existing vision models, which are used for tasks from facial recognition to object tracking.
“I’d start with saying that vision model is a very broad term,” Malinowski wrote.”If we’re referring to text-to-video generation models, from my vantage point, these models can be seen as the first steps for world models.World models emphasize world simulation, adherence to physical reality, long-term consistency of the environment, and action-conditioned generation.”
That said, Malinowski notes key differences between the world models being built by Moonvalley and those under development at World Labs.
“I believe that long-term goals are the same, but we are approaching the problem of world modelling differently,” he wrote.”We focus on using video models as first-class citizens, where spatial intelligence is more implicit.In the short term, our approach seems more suited for filmmaking and robotics due to motion and soft bodies modelling.”.