Abstract
BackgroundThe Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.ResultsOur paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.ConclusionsLA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.
Highlights
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles
A key consideration is that the well-crafted manual workflows, developed by expert curators in biomedical databases, typically use rules based on context and rhetorical structure-dependent clues found only in the full-text of an article. It is important for the developers of Biomedical Natural Language Processing (BioNLP) applications to have access to an accurate representation of the full-text of papers derived from Portable Document Format (PDF) files, see [16]
Our goal is to provide an open-source software mechanism for automated decomposition and conversion of PDF files of research articles into a simple text format that other NLP groups can incorporate into their toolsets
Summary
We have evaluated the three steps of our system independently of each other. we will present our evaluation methods for each of the three steps of LA-PDFText and their results. We describe edit operations applied to the manually segmented page and their corresponding cost The results of this evaluation are presented in Additional file 4: Tables S4, S5, S6 and S7 under the column titled ‘Spatial Segmentation Score’. Step 2 - Classifying text blocks into rhetorical categories - evaluation The rule based segment classifier component of our software is instrumented to produce color-coded segments depending upon the type of section to which each segment belongs This color-coding is used in the manual evaluation to count the number of segments of each section that were correctly classified (true positives; TP), those that were incorrectly classified (false positives; FP) and those that were missed by the rule engine (false negatives; FN). The interruption is precisely the sort of error that is unacceptable in many applications of BioNLP, especially those
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.