Integrating Embodied Robots into a smart city’s networked system can significantly enhance the city’s operational efficiency. These robots can be connected to the city’s network, receiving and transmitting data in real time for enhancing human–robot communications. Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. However, most language-conditioned policy-related research is limited to specific datasets and cannot generalize across different environments. In this study, we propose a novel imitation learning framework tailored for language-conditioned robotic tasks. Our framework includes specialized encoders designed for various benchmarks and utilizes two distinct models: the Transformer and Diffusion models. We rigorously evaluate this framework in three different robotic environments. Our findings indicate that the framework consistently delivers superior performance across multiple domains. Notably, we observe that the Transformer model is particularly effective in managing tasks with long trajectories, whereas the Diffusion model demonstrates enhanced proficiency in generating trajectories from limited training datasets. Our approach showcases remarkable generalization capabilities across a range of tasks and achieves significantly higher success rates in task completion.
Read full abstract