Document layout analysis (DLA) is a technique used to locate and classify layout elements in a document, such as Table, Figure, List, and Text. While deep-learning-based methods in computer vision have shown excellent performance in detecting Text and Figures, they are still unsatisfactory in accurately recognizing the blocks of List, Title, and Table categories with limited data. To address this issue, we propose a single-stage DLA model that incorporates a Multi-Scale Shallow Visual Feature Enhancement Module (MS-SVFEM) and a Multi-Scale Cross-Feature Fusion Module (MS-CFF). The MS-SVFEM extracts multi-scale spatial information through the channel attention module, spatial attention module, and multi-branch convolution. The MS-CFF fuses different level features through an attention mechanism. The experiments showed that the mAP accuracy of YOLOLayout compared to the baseline model is 2.2% and 1.5% higher on the PubLayNet Dataset and the ISCAS-CLAD dataset.
Read full abstract