Abstract

As a fundamental step of document related tasks, document classification has been widely adopted to various document image processing applications. Unlike the general image classification problem in the computer vision field, text document images contain both the visual cues and the corresponding text within the image. However, how to bridge these two different modalities and leverage textual and visual features to classify text document images remains challenging. In this paper, we present a cross-modal deep network that enables to capture both the textual content and the visual information included in document images. Thanks to the efficient jointly learning of text and image features, the proposed cross-modal approach shows its superiority to the state-of-the-art single-modal methods. In this paper, we propose to use NASNet-Large and Bert to extract image and text features respectively. Experimental results demonstrate that the proposed cross-modal approach achieves new state-of-the-art results for text document image classification on the benchmark Tobacco-3482 dataset, outperforming the current state-of-the-art method by 3.91% of classification accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call