MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Jingyuan Qi,Minqian Liu,Ying Shen,Zhiyang Xu,Lifu Huang

doi:10.1609/aaai.v38i17.29854

Abstract

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge – MULTISCRIPT, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MULTISCRIPT covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MULTISCRIPT, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models’ feasibility in clinical decision-making
Elena Sblendorio ... Giancarlo Cicolini
International Journal of Medical Informatics | VOL. 188
Elena Sblendorio, et. al.Elena Sblendorio ... Giancarlo Cicolini
26 May 2024
International Journal of Medical Informatics | VOL. 188

A multimodal machine learning approach to generate news articles from geo-tagged images
Abhay Gotmare ... Gandharva Thite
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14
Abhay Gotmare, et. al.Abhay Gotmare ... Gandharva Thite
01 Jun 2024
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14

Automating Systematic Literature Reviews with Retrieval-Augmented Generation: A Comprehensive Overview
Binglan Han ... Anuradha Mathrani
Applied Sciences | VOL. 14
Binglan Han, et. al.Binglan Han ... Anuradha Mathrani
09 Oct 2024
Applied Sciences | VOL. 14

Evaluation of large language models (LLMs) on the mastery of knowledge and skills in the heating, ventilation and air conditioning (HVAC) industry
Jie Lu ... Fengtai He
Energy and Built Environment | VOL. -
Jie Lu, et. al.Jie Lu ... Fengtai He
01 Mar 2024
Energy and Built Environment | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence