Abstract

Document layout analysis plays an important role in the area of Document Understanding. It is responsible for identifying and classifying the different components of digital documents. Currently, there is no universal algorithm that fits all types of digital documents. This work presents a novel approach for identifying tables, figures, isolated equations and text regions in scientific papers using deep learning and computer vision techniques. Our proposed approach is a three-stage system: (i) obtaining the spectrograms of the horizontal and vertical intensity histograms of segmented regions of interest; (ii) labeling segmented regions of interest into text, table, and figures using a deep convolutional neural network classifier; and (iii) identifying isolated equations in text regions using Bag of Visual Words (BOVW) with Zernike moments. We built a new dataset composed of 11007 papers to perform the experiments, using two common segmentation metrics to evaluate our model: (1) Adjusted Rand Index (ARI) and (2) Variation of Information (VI). The proposed document layout analysis system reached an overall accuracy of 96.2685%, outperforming prior art with a less computational cost.

Highlights

  • D OCUMENT layout analysis (DLA) [1] is still one of the most challenging areas of information retrieval [2] due to the wide variety of documents that can be authored and the lack of structured information [3] in standardized formats for exchanging information such as Portable Document File (PDF)

  • We present a new approach for DLA, which consists of: 1) Segmentation of the regions of interest; 2) Generation of spectrograms using horizontal and vertical pixels profile projections of the regions of interest; 3) Implementation of a deep Convolutional Neural Networks (CNNs), trained for three classes: text, table, and figures; and 4) Use of the Bag of Visual Words (BOVW) technique to identify lines with isolated equations within the text regions

  • Sparse Ratio: In [17] and [15] it was found that lines with isolated equations produce a higher sparse ratio than lines without isolated equations. In addition to these three features, we propose a new set of features based on a Bag of Visual Words (BoVW) of each symbol contained in a line

Read more

Summary

Introduction

D OCUMENT layout analysis (DLA) [1] is still one of the most challenging areas of information retrieval [2] due to the wide variety of documents that can be authored and the lack of structured information [3] in standardized formats for exchanging information such as Portable Document File (PDF). The main feature of PDFs created digitally is the preservation of the visual structure of the document in any electronic device, turning PDF files into the current standard format for electronic document exchange [4]. There is no universal algorithm [5] that fully understands all regions of a digital document, i.e., identifying and segmenting all the individual elements such as tables, graphs, inline/isolated equations, paragraphs, etc. The problem of identifying and classifying the elements of a digital document based on its visual structure can be grouped into three categories [6]: (i) foreground regions, (ii) background regions, and (iii) both foreground and background regions. Foreground-based approaches perform page segmentation by analyzing the foreground pixels, which normally are the text characters. The approaches that analyze both foreground and background pixels try to ensemble the results of both individual approaches

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call