Abstract
Deep imitation learning enables the learning of complex visuomotor skills from raw pixel inputs. However, this approach suffers from the problem of overfitting to the training images. The neural network can easily be distracted by task-irrelevant objects. In this letter, we use the human gaze measured by a head-mounted eye tracking device to discard task-irrelevant visual distractions. We propose a mixture density network-based behavior cloning method that learns to imitate the human gaze. The model predicts gaze positions from raw pixel images and crops images around the predicted gazes. Only these cropped images are used to compute the output action. This cropping procedure can remove visual distractions because the gaze is rarely fixated on task-irrelevant objects. This robustness against irrelevant objects can improve the manipulation performance of robots in scenarios where task-irrelevant objects are present. We evaluated our model on four manipulation tasks designed to test the robustness of the model to irrelevant objects. The results indicate that the proposed model can predict the locations of task-relevant objects from gaze positions, is robust to task-irrelevant objects, and exhibits impressive manipulation performance especially in multi-object handling.
Highlights
I MITATION learning involves learning a policy by observing expert demonstrations
We proposed the use of eye tracking to improve imitation learning by robots to perform manipulation tasks
An Mixture Density Network (MDN)-based architecture was proposed to learn visual attention and crop images around the predicted gazes to prevent a degradation in performance owing to visual distractions
Summary
I MITATION learning involves learning a policy by observing expert demonstrations. One application of imitation learning is in robotics (e.g., [1]–[4]), because this method offers potential for learning complex policies. In case there are changes in the background (i.e., the advent of task-irrelevant objects), they change the network’s policy output This is because the mapping of the output action from visual features relies on fully connected layers. The acquired gaze position as well as state-action demonstration pairs are used to learn manipulation tasks. As this method discards out-of-gaze objects to change the policy, the policy is robust to such visual distractions as the advent of unseen and new objects. 3) We empirically show that gaze prediction makes the learning policy more robust to visual distractions and improves multi-object manipulation performance The main contributions of this paper are as follows: 1) To the best of our knowledge, this research is the first to use the human gaze to improve imitation learning performance for robot manipulation tasks. 2) We propose using the MDN to predict the human gaze. 3) We empirically show that gaze prediction makes the learning policy more robust to visual distractions and improves multi-object manipulation performance
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.