Robot Foundation Models

LLM-Planner

Image

↓

LLM

↓

"Text Plan"

↓

Skill Policy

↓

Actions

VLA

Image + Language

↓

VLM Backbone

*pretrained

↓

Action Decoder

↓

Actions

VAM

Frame t, t+1

↓

Video Backbone

*pretrained

↓

IDM

↓

Actions

"What if robots already understood language?"

LLMs have ingested the entire internet. They know that "get me a snack" means finding food and bringing it. SayCan (Google, 2022) was the first to harness this knowledge for robotics.

But there was a problem: LLMs hallucinate. They'll confidently tell a robot to "pick up the sponge" when there's no sponge in sight. SayCan's fix was simple — before acting, check if the robot can actually do what the LLM suggests. This grounding took success rates from 38% to 84%.

"What if robots already understood images?"

Vision-language models (VLMs) understand images and text together. They are trained on billions of image-caption pairs from the internet. RT-2 (Google DeepMind, 2023) was the first to fine-tune one directly for robot control.

The payoff was immediate: when shown objects it had never seen during training, RT-2 succeeded 62% of the time versus 32% for the previous approach. It could recognize novel objects because its backbone had seen millions of internet images. OpenVLA (2024) then made the entire approach open-source, sparking the explosion of VLAs that followed: π0, GR00T, Helix, and dozens more.

"What if robots already understood physics?"

Video models trained on internet footage have watched billions of examples of objects falling, hands grasping, liquids pouring. They already know how the world moves. GR-1 (ByteDance, 2023) laid the groundwork for video-based robot control but mimic-video (2025) was the first to fully exploit this using a pretrained video model backbone.

The early results are striking: 10x better sample efficiency than VLAs. Cosmos-Policy has state-of-the-art performance benchmarks. 1XWM trained its entire action model on just 400 hours of robot data, compared to the 10,000+ hours that trained leading VLAs. The physics is already learned, all that remains is teaching the robot to act on it.