Abstract

Document key information extraction (DKIE) is a crucial topic that aims at automatically comprehending documents with complex formats and layouts (invoices, business insurance, etc.). While pre-trained approaches have shown high performance on many DKIE tasks, they suffer from three major challenges. First of all, these approaches ignore the ambiguity resulting from similar text representations before cross-modal interaction. Secondly, they do not consider cross-modal representation alignment before cross-modal interaction. Finally, self-attention layers in cross-modal interaction incur significant computing costs, making it hard to perform joint representation learning from all negative samples. To address these issues, we propose a Dynamical Cross-Modal Alignment Interaction framework (DCMAI). To be more specific, (1) A prior knowledge-guided module is designed to adaptively mine fine-grained visual information to disambiguate similar text representations. (2) A crossover alignment loss is formulated to align cross-modal representations before cross-modal interaction. (3) A hierarchical interaction sampling scheme is introduced to obtain a small but efficient subset of cross-modal negative samples, and a contrastive loss is employed to improve joint representation learning. Comprehensive experiments show that the proposed DCMAI achieves state-of-the-art performance than competitive baselines on several public downstream benchmarks. Code will be open to the public.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.