Abstract

Three NLP (Natural Language Processing) automated summarization techniques were tested on a special collection of Catholic Pamphlets acquired by Hesburgh Libraries. The automated summaries were generated after feeding the pamphlets as .pdf files into an OCR pipeline. Extensive data cleaning and text preprocessing were necessary before the computer summarization algorithms could be launched. Using the standard ROUGE F1 scoring technique, the Bert Extractive Summarizer technique had the best summarization score. It most closely matched the human reference summaries. The BERT Extractive technique yielded an average Rouge F1 score of 0.239. The Gensim python package implementation of TextRank scored at .151. A hand-implemented TextRank algorithm created summaries that scored at 0.144. This article covers the implementation of automated pipelines to read PDF text, the strengths and weakness of automated summarization techniques, and what the successes and failures of these summaries mean for their potential to be used in Hesburgh Libraries.

Highlights

  • IntroductionOf all the information fields available in the MARC catalog, the summary field ranks near the top

  • Three NLP (Natural Language Processing) automated summarization techniques were tested on a special collection of Catholic Pamphlets acquired by Hesburgh Libraries

  • This article covers the implementation of automated pipelines to read PDF text, the strengths and weakness of automated summarization techniques, and what the successes and failures of these summaries mean for their potential to be used in Hesburgh Libraries

Read more

Summary

Introduction

Of all the information fields available in the MARC catalog, the summary field ranks near the top. The summary field is behind only the author and title fields when it comes to importance to patrons (Lundgren and Simpson 1999). Internal analysis (Unpublished Data, Kasten & Flannery, 2020) of English monographs showed that Notre Dame patrons check out records that include a summary field at higher frequencies, even when adjusted for estimated popularity of the monograph. When special collections are brought to the library catalog, there often isn’t the possibility to leverage existing summaries. In 2019, when the University of Notre Dame brought in a special collection of over 5500 Catholic Pamphlets, our department did not expend hundreds of hours (or more!) of staff time to read, let alone write summaries of the 5500 texts. Natural Language Processing (NLP) Summarization methods that utilize machine

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.