Abstract

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes.

Highlights

  • Natural Language Understanding (NLU) is a subfield of natural language processing (NLP) that focuses on more specific NLP tasks, such as semantic parsing, relation extraction, sentiment analysis, dialogue agents, paraphrasing, natural language interfaces, question answering, and summarization [1]

  • A dataset consisting of 10 scientific articles on the topic of Electric Double Layer Capacitors from different journal publishers is used

  • All the articles are downloaded in Portable Document Format (PDF) format and converted into text using various popular Python PDF packages, namely Pypdf2, Pymupdf, Pdfminer.six, Pdftotext, Tika, and Grobid

Read more

Summary

Introduction

Natural Language Understanding (NLU) is a subfield of natural language processing (NLP) that focuses on more specific NLP tasks, such as semantic parsing, relation extraction, sentiment analysis, dialogue agents, paraphrasing, natural language interfaces, question answering, and summarization [1]. Material informatics has been applied in various material science research areas Among these researches are automatic material discovery from scientific documents, automatic discovery of synthesis processes from scientific documents, and extraction of material properties and values from scientific documents [2]. All these researches are based on NLP and NLU. Various researches have been carried out for the extraction of PDF into text. They did not measure how much text is extracted as a complete sentence. This study does not suggest any specific method or combination of methods or techniques to get text data from PDF documents with proper sentence boundary

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call