Abstract

Procedural text generation from visual observation of instructional videos, such as assembling, biochemical experiments, and cooking, is an essential task for scene understanding and real-world applications. The major difference from general captioning tasks is two-fold: it has a flow of material combination in instructional steps, and the materials change their state through action-involved manipulations. However, existing works do not adequately address both two issues. To this end, this paper proposes a procedural text generation framework, namely XCL4PTG, with Video Frame-wise eXplanation driven Contrastive Learning (VFXCL) module and Action Fused Material Representation Learning (AFMRL) module, generating a procedural text from the step’s frame sequence of an instructional video. The VFXCL utilizes an explanation method to determine the frame’s importance in a step’s frame sequence and derive the positive and negative sequences for self-supervised contrastive learning, aiming at enhancing step representation learning for capturing the inter-step differences; The AFMRL leverages identified actions and materials to update material states after manipulations, which contributes to step representation learning via intra-step action fused material state tracking. By integrating the two modules, they collaboratively extract the information essential for the decoder to accurately generate procedural text. The experimental results show the effectiveness of the proposed framework, which outperforms state-of-the-art video procedural text generation models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.