Humans typically make decisions based on past experiences and observations, while in the field of robotic manipulation, the robot's action prediction often relies solely on current observations, which tends to make robots overlook environmental changes or become ineffective when current observations are suboptimal. To address this pivotal challenge in robotics, inspired by human cognitive processes, we propose our method which integrates historical learning and multi-view attention to improve the performance of robotic manipulation. Based on a spatio-temporal attention mechanism, our method not only combines observations from current and past steps but also integrates historical actions to better perceive changes in robots' behaviours and their impacts on the environment. We also employ a mutual information-based multi-view attention module to automatically focus on valuable perspectives, thereby incorporating more effective information for decision-making. Furthermore, inspired by human visual system which processes both global context and local texture details, we have devised a method that merges semantic and texture features, aiding robots in understanding the task and enhancing their capability to handle fine-grained tasks. Extensive experiments in RLBench and real-world scenarios demonstrate that our method effectively handles various tasks and exhibits notable robustness and adaptability.
Read full abstract