Abstract
To meet the demand for a large amount of invoice entry work in the financial industry and improve the low accuracy of traditional manual entry, we construct SGFNet, a financial invoice information extraction network that integrates semantic graph associations and multimodal modeling. First, we construct a graph of strong and weak semantic associations between data within each modality based on the correlation of text content. Subsequently, we model the multimodal data in a unified structure, extract the text modal information of invoices along with corresponding image and layout modal information, and guide the fusion and embedding of multimodal data through semantic associations in the graph to produce a richer feature representation. Furthermore, semantically linked multimodal information is fed into an aggregated multimodal self-attention mechanism to establish effective connection between modalities. Finally, with the combination of supervised contrastive learning and smoothed Kullback–Leibler divergence in terms of loss functions, the accuracy degradation problem incurred by sample imbalance and convergence instability is reduced. In our experiments, we achieved F1 scores of 93.71% for the English financial invoice dataset and 96.27% for the Chinese dataset, indicating that the proposed method successfully extracts feature information from different data modalities, thereby achieving satisfactory information extraction results.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have