Abstract

A good task policy should explicitly interpret the preconditions of actions and the composition structure of task. We aim to automatically learn such a task policy from videos which remains challenging at present. This issue can be further aggravated when task-irrelevant components are involved in videos, such as unoperated objects and small actions. Task-irrelevant objects may introduce disruptive visual relations, and task-irrelevant actions would lead to misleading and even failed task planning. Solving both issues simultaneously is beyond the scope of existing methods. To this end, we propose Task Parse Tree (TPT) as a novel task policy representation, distinguishing task-relevant actions with definite preconditions and clear execution order. The automatic generation of TPT relies on two core designs, where spatio-temporal graph (STG) seizes the vital changes in visual relations of objects and their attributes both spatially and temporally, and conjugate action graph (CAG) models the execution logic of actions in a graph. We collect a dataset of a real-world task, Make Tea, and experiment results on the dataset show that TPT realizes both accurate and interpretable task planning in two different scenarios.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.