Tommoro Robotics Introduces CLIP-RT at RSS 2025: A Step Toward Natural Language-Driven Robot Learning
Published 2025. 6. 2.
admin
At the 2025 Robotics: Science and Systems (RSS) conference, Tommoro Robotics presented CLIP-RT, a new vision-language-action (VLA) foundation model developed in collaboration with Seoul National University. The model enables robots to learn manipulation skills directly from natural language instructions such as “move the cup to the shelf,” without requiring teleoperation or specialized training setups.
According to the research team, CLIP-RT combines vision, language, and action understanding through a contrastive imitation learning framework. This approach allows the system to map visual context and verbal commands to appropriate robotic behaviors efficiently. Despite its relatively compact size of around one billion parameters, CLIP-RT outperforms much larger models such as Stanford’s OpenVLA (seven billion parameters) by roughly 24 percentage points in average task success.
In benchmark tests using the LIBERO robotic manipulation suite, CLIP-RT achieved about 92.8 % task success and demonstrated real-time control speeds of up to 163 Hz. The system can also generalize to new tasks with only a few demonstrations, making it practical for real-world deployment.
The project represents a joint effort between Tommoro Robotics and the AI Research Institute at Seoul National University, led by Professor Byoung-Tak Zhang. By bridging natural language and robotic control, CLIP-RT aims to make teaching robots as intuitive as giving verbal instructions, marking a significant step toward more accessible and adaptive robot learning.

