Abstract
Extracting information from documents containing quantitative data in tabular format is an important but still unsolved task due to the heterogeneity of document layouts. This work aims to take a step toward developing a solution to this problem. This paper proposes a flexible, hybrid table extraction system consisting of a deep learning-based table detection module, a heuristic-based structure recognition method, and a graph-based semantic interpretation component. The proposed system is modular and supports the most frequent table layouts. Moreover, it handles both the documents in image format and PDF files with embedded text. The proposed system outperforms the baseline method and achieves results on par with state-of-the-art approaches on the challenging benchmarks from ICDAR 2013 and ICDAR 2019 table interpretation competitions. Moreover, we correct an issue with the evaluation script used in the latter competition and report extended results of the proposed method in comparison with a leading commercial product. Finally, our table extraction system achieves a high F_1 score in the scenario where raw documents are given as input and the targeted information is contained in a subset of table columns. The presented system achieves results competitive with leading methods in the field. It has already been evaluated on general-purpose data and biomedical benchmarks. We intend to continuously improve our approach and process data from other domains, e.g., financial documents. To support future research on information extraction from documents, we make the evaluation scripts and results from our experiments publicly available at https://github.com/mnamysl/tabrec-sncs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.