Abstract

Deep imitation learning enables the learning of complex visuomotor skills from raw pixel inputs. However, this approach suffers from the problem of overfitting to the training images. The neural network can easily be distracted by task-irrelevant objects. In this letter, we use the human gaze measured by a head-mounted eye tracking device to discard task-irrelevant visual distractions. We propose a mixture density network-based behavior cloning method that learns to imitate the human gaze. The model predicts gaze positions from raw pixel images and crops images around the predicted gazes. Only these cropped images are used to compute the output action. This cropping procedure can remove visual distractions because the gaze is rarely fixated on task-irrelevant objects. This robustness against irrelevant objects can improve the manipulation performance of robots in scenarios where task-irrelevant objects are present. We evaluated our model on four manipulation tasks designed to test the robustness of the model to irrelevant objects. The results indicate that the proposed model can predict the locations of task-relevant objects from gaze positions, is robust to task-irrelevant objects, and exhibits impressive manipulation performance especially in multi-object handling.

Highlights

  • I MITATION learning involves learning a policy by observing expert demonstrations

  • We proposed the use of eye tracking to improve imitation learning by robots to perform manipulation tasks

  • An Mixture Density Network (MDN)-based architecture was proposed to learn visual attention and crop images around the predicted gazes to prevent a degradation in performance owing to visual distractions

Read more

Summary

INTRODUCTION

I MITATION learning involves learning a policy by observing expert demonstrations. One application of imitation learning is in robotics (e.g., [1]–[4]), because this method offers potential for learning complex policies. In case there are changes in the background (i.e., the advent of task-irrelevant objects), they change the network’s policy output This is because the mapping of the output action from visual features relies on fully connected layers. The acquired gaze position as well as state-action demonstration pairs are used to learn manipulation tasks. As this method discards out-of-gaze objects to change the policy, the policy is robust to such visual distractions as the advent of unseen and new objects. 3) We empirically show that gaze prediction makes the learning policy more robust to visual distractions and improves multi-object manipulation performance The main contributions of this paper are as follows: 1) To the best of our knowledge, this research is the first to use the human gaze to improve imitation learning performance for robot manipulation tasks. 2) We propose using the MDN to predict the human gaze. 3) We empirically show that gaze prediction makes the learning policy more robust to visual distractions and improves multi-object manipulation performance

RELATED WORK
Hardware
Data Processing
BEHAVIOR CLONING WITH GAZE PREDICTION
Mixture Density Network
Model Architecture
Loss Function
Experimental Setup
Assessment of Performance in Terms of Predicting Gaze
Evaluating Performance on Manipulation Tasks
DISCUSSION
Task specification
Model specifications
Evaluation procedure
Findings
Gaze prediction evaluation with various metrics

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.