Layout-aware text extraction from full-text PDF of scientific articles

Cartic Ramakrishnan,Abhishek Patnia,Eduard Hovy,Gully Apc Burns

doi:10.1186/1751-0473-7-7

Cartic Ramakrishnan, Abhishek Patnia + Show 2 more

Open Access

https://doi.org/10.1186/1751-0473-7-7

Copy DOI

Abstract

BackgroundThe Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.ResultsOur paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.ConclusionsLA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

Highlights

LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles
A key consideration is that the well-crafted manual workflows, developed by expert curators in biomedical databases, typically use rules based on context and rhetorical structure-dependent clues found only in the full-text of an article. It is important for the developers of Biomedical Natural Language Processing (BioNLP) applications to have access to an accurate representation of the full-text of papers derived from Portable Document Format (PDF) files, see [16]
Our goal is to provide an open-source software mechanism for automated decomposition and conversion of PDF files of research articles into a simple text format that other NLP groups can incorporate into their toolsets

Summary

Results

We have evaluated the three steps of our system independently of each other. we will present our evaluation methods for each of the three steps of LA-PDFText and their results. We describe edit operations applied to the manually segmented page and their corresponding cost The results of this evaluation are presented in Additional file 4: Tables S4, S5, S6 and S7 under the column titled ‘Spatial Segmentation Score’. Step 2 - Classifying text blocks into rhetorical categories - evaluation The rule based segment classifier component of our software is instrumented to produce color-coded segments depending upon the type of section to which each segment belongs This color-coding is used in the manual evaluation to count the number of segments of each section that were correctly classified (true positives; TP), those that were incorrectly classified (false positives; FP) and those that were missed by the rule engine (false negatives; FN). The interruption is precisely the sort of error that is unacceptable in many applications of BioNLP, especially those

Conclusions

Discussion

Conclusion & future work

Settles B

17. Forgy CL

21. Summers Kristen

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Source Code for Biology and Medicine	Publication Date: May 28, 2012
Citations: 111	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Layout-aware text extraction from full-text PDF of scientific articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Source Code for Biology and Medicine

Lead the way for us

Similar Papers

MSL: Mining published scientific literature for the extraction and classification of text and images to support IR capabilities
Ahmed Zeeshan ... Zeeshan Saman
Frontiers in Neuroinformatics | VOL. 10
Ahmed Zeeshan, et. al.Ahmed Zeeshan ... Zeeshan Saman
01 Jan 2015
Frontiers in Neuroinformatics | VOL. 10

Creating a more productive, clutter-free, paperless office: a primer on scanning, storage and searching of PDF documents on personal computers
L Citrome
International Journal of Clinical Practice | VOL. 62
L CitromeL Citrome
01 Feb 2008
International Journal of Clinical Practice | VOL. 62

CheckPDF: Check What is Inside Before Signing a PDF Document
Bhavya Bansal ... Ronak Patel
-
Bhavya Bansal, et. al.Bhavya Bansal ... Ronak Patel
01 Jan 2015
01 Jan 2015

Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format
Rajnish M Rakholia ... Jatinderkumar R Saini
-
Rajnish M Rakholia, et. al.Rajnish M Rakholia ... Jatinderkumar R Saini
24 Aug 2016
24 Aug 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Layout-aware text extraction from full-text PDF of scientific articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Source Code for Biology and Medicine