Abstract

As a knowledge carrier, the diagram is widely distributed in many aspects of human life, such as textbooks, architectural drawings, and documents. Different from natural images, representations of visual elements in the diagram are sparser, and similar visual representations can reflect dissimilar semantics. Thus, current methods fail to capture the visual elements with precise semantics. To address this issue, regarding the aligned visual and textual elements as pairs is the way to assign the precise semantics of textual elements to visual elements. We build the first diagram dataset named align diagram element (ADE), which includes annotations for alignment relations between visual and textual elements. And we propose a visual-textual alignment model (VTAM) including graph construction and optimal aligning phases. In the graph construction phase, the relational graphs are constructed between different elements with four relational operators. The relational operators are designed to measure the relations between different elements, according to distance, connection line, inclusion, and feature similarity. In the optimal aligning phase, the representation at each visual-textual pair is improved as a weighted sum of the representations on all relational graphs. Experimental results show that our VTAM achieves a significant improvement of 10.9% on mean test folds of the ADE dataset than the current best competitor. In order to explore the role of alignment relations in diagram parsing, we introduce VTAM to diagram-related tasks, such as diagram question answering (DQA). And we achieve 2.8% to 5.9% and 4.6% to 5.1% improvements on AI2D and Foodwebs after adding VTAM. Our dataset and code are released at: https://github.com/ADE-dataset/ADE-dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.