Speakers
Keynote Speakers
Cordelia Schmid
Title
Multimodal video understanding & vision-language guided robotics
Abstract
In this talk, we first present recent progress on large-scale learning of multimodal video representations. We present Vid2Seq, a model for dense video captioning that takes as input video and speech and predicts both temporal boundaries and textual descriptions simultaneously. We then present an approach for video question that relies on multimodal reasoning. We show that our approach achieves state-of-the-art results on visual question answering.
In the second part of the talk we introduce recent work on vision guided navigation and robot manipulation given language instructions. This work builds on and extends vision-language transformers by integrating action history and predicting actions. The History Aware Multimodal Transformer outperforms the state of the art on different vision-language-navigation benchmarks. Further improvements are achieved by integrating map information into the transformer architecture. We show object goal navigation in the real-world, here on the Tiago robot. Next, we demonstrate that such transformer-based approach can also be used for manipulation and evidence the importance of 3D visual representation. Our approach achieves excellent real-world performance on a UR5 arm.
Short bio
Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at Inria, where she is a research director.
Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins prize in 2006, 2014 and 2016, the Koenderink prize in 2018 and the Helmholtz prize in 2023, all for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humboldt research award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner award in 2020 and the PAMI distinguished researcher award in 2021. In 2023 she received the Körber European Science Prize and in 2024 the European Inventor Award in the research category. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), an editor-in-chief for IJCV (2013--2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Starting 2018 she holds a joint appointment with Google research.
Homepage link
https://cordeliaschmid.github.io/
Tomas Lozano-Perez
Title
Building Robots that Understand Their Worlds
Abstract
An enduring goal of AI and robotics has been to build a robot capable of robustly performing a wide variety of tasks in a wide variety of environments; not by sequentially being programmed (or taught) to perform one task in one environment at a time, but rather by intelligently choosing appropriate actions for whatever task and environment it is facing. In spite of tremendous progress in recent years, this goal remains a challenge.
In this talk I’ll describe work in our lab aimed at the goal of general-purpose robot manipulation by integrating the in-depth geometric reasoning and planning capabilities of task-and-motion planners with various forms of model learning. In particular, I’ll describe approaches to manipulating objects without prior shape models, to acquiring composable sensorimotor skills from few demonstration, and to autonomously learning how to improve a robot’s performance with practice.
Short bio
Tomas Lozano-Perez is currently the School of Engineering Professor in Teaching Excellence at the Massachusetts Institute of Technology (MIT), USA, where he is a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). He has been Associate Director of the Artificial Intelligence Laboratory and Associate Head for Computer Science of MIT's Department of Electrical Engineering and Computer Science. He was a recipient of the 2021 IEEE Robotics and Automation Award, the 2011 IEEE Robotics Pioneer Award and a 1985 Presidential Young Investigator Award. He is a Fellow of the AAAI, ACM, and IEEE.
His research has been in robotics (configuration-space approach to motion planning), computer vision (interpretation-tree approach to object recognition), machine learning (multiple-instance learning), medical imaging (computer-assisted surgery) and computational chemistry (drug activity prediction). His current research is aimed at integrating machine learning with task, motion and decision-theoretic planning for robotic manipulation.
Homepage link
Jitendra Malik
Title
Robot Learning, with inspiration from child development
Abstract
For intelligent robots to become ubiquitous, we need to “solve" locomotion, navigation and manipulation at sufficient reliability in widely varying environments. Learning approaches have been responsible for most recent advances, but they are held up by the lack of “big data” at the scale available in language and vision. In my talk, I will showcase recent research results on all three tasks. In locomotion, following past work on quadrupeds, we now have demonstrations of humanoid walking in a variety of challenging environments. In navigation, we pursued the task of “Go to Any Thing” – a robot, on entering a newly rented Airbnb, should be able to find objects such as TV sets or potted plants. In manipulation we studied dexterous dynamic tasks such as in-hand rotation and twisting off caps of bottles. RL in simulation and sim-to-real have been workhorse technologies for us, assisted by a few technical innovations. For dexterous manipulation, multimodal perception is key – vision, touch and proprioception. The ability to exploit visual imitation would go a long way to solving the big data problem, and we have made major progress on the prerequisite steps of 4D reconstruction of human bodies, hands, and manipulable objects.
Short bio
Jitendra Malik is Arthur J. Chick Professor of EECS at UC Berkeley, and (part-time) Research Scientist Director at FAIR, Meta Inc. His group has conducted research on many different topics in computer vision, computer graphics, machine learning and robotics resulting in concepts such as anisotropic diffusion, high dynamic range imaging, normalized cuts, R-CNN and rapid motor adaptation. His publications have received eleven best paper awards, including six test of time awards - the Longuet-Higgins Prize for papers published at CVPR (three times) and the Helmholtz Prize for papers published at ICCV (three times). He has mentored more than 80 PhD students and postdoctoral fellows, many of whom have gone on to become leading researchers at places like MIT, Berkeley, CMU, Caltech, Cornell, UIUC, UPenn, Michigan, UT Austin, Google and Meta.
Jitendra received the 2016 ACM/AAAI Allen Newell Award, 2018 IJCAI Award for Research Excellence in AI, and the 2019 IEEE Computer Society’s Computer Pioneer Award for “leading role in developing Computer Vision into a thriving discipline through pioneering research, leadership, and mentorship”. He is a member of the US National Academy of Sciences, the National Academy of Engineering and Fellow, American Academy of Arts and Sciences.
Homepage link
Early Career Keynote Speakers
Georgia Chalvatzaki
Title
On the Quest for Robotic Embodied Intelligence: The Role of Structure in Robot Learning
Abstract
Achieving robotic embodied intelligence requires robots to learn to balance perception and action seamlessly, just like humans—even from an early age—navigate and manipulate their environments through a continuous cycle of integrated perception, action, and learning. In this keynote, I will explore how structure can be integrated at different layers of robot learning algorithms to enable faster and safer learning while fostering generalized behaviors.
I will demonstrate how embedding and leveraging structure within representation learning, motion generation, decision-making, and exploration strategies in robot reinforcement learning leads to more efficient, safe, and versatile behaviors in complex robotic systems, allowing for effective coordination across multiple embodiments. This structured approach lays a foundation for the future of autonomous robot learning systems to efficiently adapt to, and integrate in new environments. Just as structured learning enables humans to achieve natural intelligence, I argue that structured robot learning is essential for developing robotic embodied intelligence, ultimately guiding us toward smart and safe robotic assistance in our daily lives.
Short Bio
Georgia is a Full Professor for Interactive Robot Perception & Learning at the Computer Science Department of the Technical University of Darmstadt and Hessian.AI. She is the recipient of the renowned Emmy Noether (EN) grant of the German Research Foundation (DFG) for her project iROSA (2021-2027), and has been awarded an ERC Starting Grant in 2024 for her research project SIREN (to start in 2025).
She received her Ph.D. in December 2019 at the Intelligent Robotics and Automation Lab at the Electrical and Computer Engineering School of the National Technical University of Athens, Greece, with her thesis “Human-Centered Modeling for Assistive Robotics: Stochastic Estimation and Robot Learning in Decision-Making. From October 2019 till February 2021, she was a Postdoctoral researcher at the Intelligent Autonomous Systems group at TU Darmstadt. She started her independent Emmy Noether DFG-funded research group in March 2021. In February 2022, Georgia was promoted to Assistant Professor (W1), and after just one year she became a Full Professor (W3) at TU Darmstadt.
Homepage Link
https://pearl-lab.com/people/georgia-chalvatzaki/
Danfei Xu
Title
Robot Learning from Embodied Human Data
Abstract
The foundation of modern AI is scalable knowledge transfer from humans to machines. While Computer Vision and NLP can glean from exabytes of human-generated data on the Internet, Robot Learning still heavily relies on resource-intensive processes such as teleoperation. Can we capture how humans interact with the physical world as effortlessly as the Internet captures the virtual world? We propose that leveraging embodied human data is a crucial step toward this future. Just as the Internet evolved into an unintentional data repository for AI, we envision systems that effortlessly capture rich embodied experiences from human activities, without humans’ conscious participation. In this talk, I will present our work in hardware, systems, and algorithms for collecting and learning from embodied human data. I will conclude by sharing our vision of human-centric robot learning, where machines can better understand and interact with humans and human environments by taking a human perspective.
Short Bio
Danfei Xu is an Assistant Professor in the School of Interactive Computing at Georgia Institute of Technology, where he directs the Robot Learning and Reasoning Lab. He is also a researcher at NVIDIA AI. He earned his Ph.D. in Computer Science from Stanford University in 2021. His research focuses on machine learning methods for robotics, particularly in manipulation planning and imitation learning. His work has received Best Paper nominations at the Conference on Robot Learning (CoRL) and IEEE Robotics and Automation Letters (RA-L).
Homepage Link
https://faculty.cc.gatech.edu/~danfei/
Vikash Kumar
Title
Robotics – the future is anything but optimal!
Abstract
This talk will explore a unified framework for robotic intelligence through the lens of robust control. I’ll discuss how generalization in robotics can be achieved by identifying and leveraging invariant structures across diverse domains—compact spaces of invariances that are inherently robust to variation. I’ll outline how this perspective provides a unified lens bringing together multiple ongoing advancements in our field – foundation model building, cross-embodied learning, task-agnostic representations, Sim2real adaptations, etc.
Technically grounding this perspective, I’ll introduce “Foundation MDP” - A Representational Viewpoint on Robotic Decision-Making.
To illustrate its universality, we’ll review recent developments, including representations for observations (RRL, R3M, MVP, VC-1, GenAug), goals (Zest), rewards (VIP, LIV), and actions (H2R, MyoDex, SAR), which align under this robust control viewpoint. We’ll also discuss how the shift to robust control has reduced the dependency on high-precision hardware, enabling performance even with affordable, lower-quality hardware platforms like ROBEL and ALOHA. Finally, I’ll present RoboAgent - a highly efficient universal agent - that underscores the potential of this approach in developing scalable, real-world robotic agents.
Short Bio:
Vikash Kumar is an Adjunct Professor at the Robotics Institute, CMU. He finished his Ph.D. at the University of Washington with Prof. Sergey Levine and Prof. Emo Todorov and his M.S. and B.S. from the Indian Institute of Technology (IIT), Kharagpur. His professional experience includes roles as Sr. Research Scientist at FAIR-MetaAI, and Research Scientist at Google-Brain and OpenAI.
Vikash’s research centers on understanding the fundamentals of embodied intelligence across biological, digital, and electromechanical systems. His research leverages data-driven techniques to realize artificial beings — both digital and physical — that are indistinguishable from humans in their appearance, spatial reasoning, and behavioral intelligence. His work has led to advancements such as human-level dexterity in anthropomorphic robotic hands as well as physiological digital twins, low-cost scalable systems capable of contact-rich behaviors, skilled multi-task multi-skill robotic agents, and more.
He is the lead creator of MyoSuite and RoboHive, and a founding member of the MuJoCo physics engine, now widely used in the fields of Robotics and Machine Learning. His works have been recognized with the best Master's thesis award, best manipulation paper at ICRA’16, best paper award at ICRA'24, best workshop paper ICRA'22, CIFAR AI chair'20 (declined), and have been widely covered in a wide variety of media outlets such as NewYorkTimes, Reuters, ACM, WIRED, MIT Tech reviews, IEEE Spectrum, etc.
Homepage Link