LLMs have ingested the entire internet. They know that "get me a snack" means finding food and bringing it. SayCan (Google, 2022) was the first to harness this knowledge for robotics.
But there was a problem: LLMs hallucinate. They'll confidently tell a robot to "pick up the sponge" when there's no sponge in sight. SayCan's fix was simple — before acting, check if the robot can actually do what the LLM suggests. This grounding took success rates from 38% to 84%.
Vision-language models (VLMs) understand images and text together. They are trained on billions of image-caption pairs from the internet. RT-2 (Google DeepMind, 2023) was the first to fine-tune one directly for robot control.
The payoff was immediate: when shown objects it had never seen during training, RT-2 succeeded 62% of the time versus 32% for the previous approach. It could recognize novel objects because its backbone had seen millions of internet images. OpenVLA (2024) then made the entire approach open-source, sparking the explosion of VLAs that followed: π0, GR00T, Helix, and dozens more.
Video models trained on internet footage have watched billions of examples of objects falling, hands grasping, liquids pouring. They already know how the world moves. GR-1 (ByteDance, 2023) laid the groundwork for video-based robot control but mimic-video (2025) was the first to fully exploit this using a pretrained video model backbone.
The early results are striking: 10x better sample efficiency than VLAs. Cosmos-Policy has state-of-the-art performance benchmarks. 1XWM trained its entire action model on just 400 hours of robot data, compared to the 10,000+ hours that trained leading VLAs. The physics is already learned, all that remains is teaching the robot to act on it.
Loading models