Abstract

We propose an architecture for a system that will “watch and listen to” an instructional video of a human performing a task and translate the audio and video information into a task for a robot to perform. This enables the use of readily available instructional videos from the Internet to train robots to perform tasks instead of programming them. We implemented an operational prototype based on the architecture and showed it could “watch and listen to” two instructional videos on how to clean golf clubs and translate the audio and video information from the instructional video into tasks for a robot to perform. The key contributions of this architecture are: integration of multiple modalities using trees and pruning with filters; task decomposition into macro-tasks composed of parameterized task-primitives and other macro-tasks, where the task-primitive parameters are an action (e.g., dip, clean, dry) taken on an object (e.g., golf club) using a tool (e.g., pail of water, brush, towel); and context, for determining missing and implied task-primitive parameter values, as a set of canonical task-primitive parameter values with a confidence score based on the number of times the parameter value was detected in the video and audio information and how long ago it was detected.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.