Automatic information extraction of content and style format in paged documents is challenging. It requires the conversion of the original document into a granular level of details for which every document section and content is identifiable. This functionality or tool does not exist for any academic research document yet. In this paper, we present an automated process of parsing research paper documents into XML files using a formal method approach of context-free grammars (CFGs) and regular expressions (REGEXs) definable of a standard template. We created a tool for the algorithms to parse these documents into tree-like structures organized as XML files named research_XML (RX) parser. The RX tool performed the extraction of syntactic structure and semantic information of the document’s contents into XML files. These XML output files are lightweight, analyzable, query-able, and web interoperable. The RX tool has a success rate of 91% when evaluated on fifty varying research documents of 160 average pages and 8,004 total pages. The tool and test data are accessible on GitHub repo. The novelty of our process is specific to applying formal techniques for information extraction in structured multipaged documents and academic research documents thus advancing the research in automatic information extraction.
Read full abstract