Unlocking the Digitized Historical Newspaper Archive

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This paper aims to utilize historical newspapers through the application of computer vision and machine/deep learning to extract the headlines and illustrations from newspapers for storytelling. This endeavor seeks to unlock the historical knowledge embedded within newspaper contents while simultaneously utilizing cutting-edge methodological paradigms for research in the digital humanities (DH) realm. We targeted to provide another facet apart from the traditional search or browse interfaces and incorporated those DH tools with place- and time-based visualizations. Experimental results showed our proposed methodologies in OCR (optical character recognition) with scraping and deep learning object detection models can be used to extract the necessary textual and image content for more sophisticated analysis. Timeline and geodata visualization products were developed to facilitate a comprehensive exploration of our historical newspaper data. The timeline-based tool spanned the period from July 1942 to July 1945, enabling users to explore the evolving narratives through the lens of daily headlines. The interactive geographical tool can enable users to identify geographic hotspots and patterns. Combining both products can enrich users’ understanding of the events and narratives unfolding across time and space.

Similar Papers
  • Single Report
  • Cite Count Icon 3
  • 10.21236/ada458699
Full-Text Access to Historical Newspapers
  • Apr 1, 1999
  • Tapas Kanungo + 1 more

Newspapers are rich records of U.S. history.Due to the deterioration of older newspapers, the National Endowment for the Humanities is archiving 19th century newspapers on microfilm.Although microfilm is a good preservation method, it provides limited access to researchers and the general public.We are building a system to provide universal access to digital images and full-text content of historical newspapers.The system has three main components: (a) An Optical Character Recognition (OCR) module that converts digitized images into searchable text and identifies regions.(b) An Information Retrieval module that applies linguistic information to aid in segmentation, indexing, and retrieval of the noisy OCR'd text.(c) A User Interface module that allows historians and educators to query and view retrieved documents.Thus far, we have developed two OCR techniques targeted to processing historical newspapers and we have built a user interface to search the OCR output and superimpose matches on a page image from the newspaper.

  • Research Article
  • 10.5325/edgallpoerev.17.1.84
Poe in Cyberspace: Have Poe Websites Become an Endangered Species?
  • Apr 1, 2016
  • The Edgar Allan Poe Review
  • Heyward Ehrlich

Poe in Cyberspace: Have Poe Websites Become an Endangered Species?

  • Book Chapter
  • Cite Count Icon 21
  • 10.1007/978-3-540-28640-0_11
Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents
  • Jan 1, 2004
  • Min Soo Kim + 3 more

The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancient Chinese characters, but also have complex page layouts. As a result, it is not easy to utilize conventional OCR(optical character recognition) system about historical documents even if OCR has received the most attention for several years as a key module in digitalization. We have been developing OCR-based digitalization system of historical documents for years. In this paper, we propose dedicated segmentation and rejection methods for OCR of Korean historical documents. Proposed recognition-based segmentation method uses geometric feature and context information with Viterbi algorithm. Rejection method uses Mahalanobis distance and posterior probability for solving out-of-class problem, especially. Some promising experimental results are reported.

  • Single Report
  • 10.5281/zenodo.6602429
Optical character recognition quality affects perceived usefulness of historical newspaper clippings
  • Jun 1, 2022
  • arXiv (Cornell University)
  • Kimmo Kettunen + 4 more

Introduction. We study effect of different quality optical character recognition in interactive information retrieval with a collection of one digitized historical Finnish newspaper. Method. This study is based on the simulated interactive information retrieval work task model. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869-1918 with ca. 1.45 million auto segmented articles. Our article search database had two versions of each article with different quality optical character recognition. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top-10 results using graded relevance scale of 0-3 without knowing about the optical character recognition quality differences of the otherwise identical articles. Analysis. Analysis of the user evaluations was performed by comparing mean averages of evaluations scores in user sessions. Differences of query results were detected by analysing lengths of returned articles in pre-formulated and self-formulated queries and number of different documents retrieved overall in these two sessions. Results. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles positively. Conclusions. We were able to show that improvement in optical character recognition quality of documents leads to higher mean relevance evaluation scores of query results in our historical newspaper collection. To the best of our knowledge this simulated interactive user-task is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of optically read text.

  • Research Article
  • Cite Count Icon 2
  • 10.1108/ajim-05-2023-0180
Unearthing historical insights: semantic organization and application of historical newspapers from a fine-grained knowledge element perspective
  • Nov 14, 2023
  • Aslib Journal of Information Management
  • Shaodan Sun + 2 more

PurposeThis paper aims to amplify the retrieval and utilization of historical newspapers through the application of semantic organization, all from the vantage point of a fine-grained knowledge element perspective. This endeavor seeks to unlock the latent value embedded within newspaper contents while simultaneously furnishing invaluable guidance within methodological paradigms for research in the humanities domain.Design/methodology/approachAccording to the semantic organization process and knowledge element concept, this study proposes a holistic framework, including four pivotal stages: knowledge element description, extraction, association and application. Initially, a semantic description model dedicated to knowledge elements is devised. Subsequently, harnessing the advanced deep learning techniques, the study delves into the realm of entity recognition and relationship extraction. These techniques are instrumental in identifying entities within the historical newspaper contents and capturing the interdependencies that exist among them. Finally, an online platform based on Flask is developed to enable the recognition of entities and relationships within historical newspapers.FindingsThis article utilized the Shengjing Times·Changchun Compilation as the datasets for describing, extracting, associating and applying newspapers contents. Regarding knowledge element extraction, the BERT + BS consistently outperforms Bi-LSTM, CRF++ and even BERT in terms of Recall and F1 scores, making it a favorable choice for entity recognition in this context. Particularly noteworthy is the Bi-LSTM-Pro model, which stands out with the highest scores across all metrics, notably achieving an exceptional F1 score in knowledge element relationship recognition.Originality/valueHistorical newspapers transcend their status as mere artifacts, evolving into invaluable reservoirs safeguarding the societal and historical memory. Through semantic organization from a fine-grained knowledge element perspective, it can facilitate semantic retrieval, semantic association, information visualization and knowledge discovery services for historical newspapers. In practice, it can empower researchers to unearth profound insights within the historical and cultural context, broadening the landscape of digital humanities research and practical applications.

  • Book Chapter
  • 10.1007/978-3-030-66519-7_3
Deep Learning for Character Recognition
  • Jan 1, 2021
  • B R Kavitha + 2 more

The advances in technologies had led to the accessibility to the entire world at hand through the Internet and mobile phones. While computers and cameras have come to pockets as smart phones, a lot of vision-related applications have become easier. What a human visualizes has been able to be visualized by machines as well through advancements in machine vision algorithms. This chapter gives a detailed study about one such computer vision application – character recognition – with the aid of deep learning techniques. In particular, this chapter is about the implementation of widely used convolutional neural networks in offline character recognition which is used in document analysis, document recognition, scene text classification, localization, and recognition.KeywordsCharacter recognitionDeep learningCNNTamil character recognition

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.3390/app10165430
Automatic CNN-Based Arabic Numeral Spotting and Handwritten Digit Recognition by Using Deep Transfer Learning in Ottoman Population Registers
  • Aug 6, 2020
  • Applied Sciences
  • Yekta Said Can + 1 more

Historical manuscripts and archival documentation are handwritten texts which are the backbone sources for historical inquiry. Recent developments in the digital humanities field and the need for extracting information from the historical documents have fastened the digitization processes. Cutting edge machine learning methods are applied to extract meaning from these documents. Page segmentation (layout analysis), keyword, number and symbol spotting, handwritten text recognition algorithms are tested on historical documents. For most of the languages, these techniques are widely studied and high performance techniques are developed. However, the properties of Arabic scripts (i.e., diacritics, varying script styles, diacritics, and ligatures) create additional problems for these algorithms and, therefore, the number of research is limited. In this research, we first automatically spotted the Arabic numerals from the very first series of population registers of the Ottoman Empire conducted in the mid-nineteenth century and recognized these numbers. They are important because they held information about the number of households, registered individuals and ages of individuals. We applied a red color filter to separate numerals from the document by taking advantage of the structure of the studied registers (numerals are written in red). We first used a CNN-based segmentation method for spotting these numerals. In the second part, we annotated a local Arabic handwritten digit dataset from the spotted numerals by selecting uni-digit ones and tested the Deep Transfer Learning method from large open Arabic handwritten digit datasets for digit recognition. We achieved promising results for recognizing digits in these historical documents.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icpr.2016.7900156
A novel text structure feature extractor for Chinese scene text detection and recognition
  • Dec 1, 2016
  • Xiaohang Ren + 5 more

Scene text information extraction plays an important role in many computer vision applications. Unlike most existing text extraction algorithms for English texts, in this paper, we focus on Chinese texts, which are more complex in stroke and structure. To tackle this challenging problem, we propose a novel convolutional neural network (CNN) based text structure feature extractor for Chinese texts. Each Chinese character contains its specific types and combination of text structure components, which is rarely seen in backgrounds. Thus, different from the features only applicable to one text extraction stage (text detection or text recognition), the text structure component feature is suitable for both Chinese text detection and recognition. A text structure component detector (TSCD) layer is designed to detect the large amount of component types, which is the most challenging part of extracting text structure component features. Through statistical classification various types of text structure component are detected by their specially designed convolutional units in the TSCD layer. With the TSCD layer, the CNN has improvements in the accuracy and uniqueness of text feature description. In the evaluation, both text detection and recognition algorithms based on the proposed text structure feature extractor achieve state-of-the-art results in two datasets.

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.neucom.2023.126702
A survey of text detection and recognition algorithms based on deep learning technology
  • Aug 18, 2023
  • Neurocomputing
  • Xiao-Feng Wang + 5 more

A survey of text detection and recognition algorithms based on deep learning technology

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-030-34058-2_31
Improving OCR for Historical Documents by Modeling Image Distortion
  • Jan 1, 2019
  • Keiya Maekawa + 4 more

Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.

  • Research Article
  • Cite Count Icon 16
  • 10.1609/aaai.v29i1.9487
Automatic Assessment of OCR Quality in Historical Documents
  • Feb 18, 2015
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Anshul Gupta + 7 more

Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases.

  • Research Article
  • Cite Count Icon 45
  • 10.1007/s10462-020-09930-6
Deep learning approaches to scene text detection: a comprehensive review
  • Jan 1, 2021
  • Artificial Intelligence Review
  • Tauseef Khan + 2 more

In recent times, text detection in the wild has significantly raised its ability due to tremendous success of deep learning models. Applications of computer vision have emerged and got reshaped in a new way in this booming era of deep learning. In the last decade, research community has witnessed drastic changes in the area of text detection from natural scene images in terms of approach, coverage and performance due to huge advancement of deep neural network based models. In this paper, we present (1) a comprehensive review of deep learning approaches towards scene text detection, (2) suitable deep frameworks for this task followed by critical analysis, (3) a categorical study of publicly available scene image datasets and applicable standard evaluation protocols with their pros and cons, and (4) comparative results and analysis of reported methods. Moreover, based on this review and analysis, we precisely mention possible future scopes and thrust areas of deep learning approaches towards text detection from natural scene images on which upcoming researchers may focus.

  • Research Article
  • Cite Count Icon 1
  • 10.18622/kher.2013.03.125.183
High School Student’s Understanding History by the Strategy of Writing Historical Document
  • Mar 31, 2013
  • The Korean History Education Review
  • Han-Jong Kim

The purpose of this study is to investigate the effect of the strategy of writing historical document on the high school student’s understanding history. To achieve this purpose, I researched students’ understanding ‘Daehan Empire and Gwangmu Reformation’ and ‘the Crusade’ according to the strategy of writing historical document. The types of writing historical document in text are classified as three : the results of interpreting the historical document are melted as body contents in text A, original sentences of historical documents are written between body contents in text B, and historical documents are presented as learning materials separated from the body content in text C. The research subjects of this study are two high school students in Chung Chung Buk-do. The results of the research are as follows.<BR> First, students reading text A tend to uncritically accept the interpretation of historical document melted in the body content. Students are inclined not to attend the writer’s view involved in the text.<BR> Second, text B drew the student’s attention to document’s sentences interposed between body contents, which are directly connected to the student’s understanding history. To write the document’s sentence as an quotation between the body contents is effective to emphasize an aspect of historical fact.<BR> Third, text C is useful to present an important historical fact and to inquire the historical issue. Students can interpret the historical document with the pluralistic view and understand the nature of historical facts through reading the type as text C.

  • Research Article
  • Cite Count Icon 1
  • 10.1785/gssrl.71.5.553
The New IASPEI Subcommittee on Historical Instruments and Documents in Seismology: Goals, Objectives and First Results
  • Sep 1, 2000
  • Seismological Research Letters
  • G Ferrari

Research Article| September 01, 2000 The New IASPEI Subcommittee on Historical Instruments and Documents in Seismology: Goals, Objectives and First Results Graziano Ferrari Graziano Ferrari SGA Storia Geofisica Ambiente Bologna, Italy ferrari@sga-storiageo.it Search for other works by this author on: GSW Google Scholar Seismological Research Letters (2000) 71 (5): 553–561. https://doi.org/10.1785/gssrl.71.5.553 Article history first online: 09 Mar 2017 Cite View This Citation Add to Citation Manager Share Icon Share Twitter LinkedIn Tools Icon Tools Get Permissions Search Site Citation Graziano Ferrari; The New IASPEI Subcommittee on Historical Instruments and Documents in Seismology: Goals, Objectives and First Results. Seismological Research Letters 2000;; 71 (5): 553–561. doi: https://doi.org/10.1785/gssrl.71.5.553 Download citation file: Ris (Zotero) Refmanager EasyBib Bookends Mendeley Papers EndNote RefWorks BibTex toolbar search Search Dropdown Menu toolbar search search input Search input auto suggest filter your search All ContentBy SocietySeismological Research Letters Search Advanced Search At the beginning of 1999, within the IASPEI Committee on Education, a subcommittee was created with the goal to promote (1) research, inventory, and recovery of historical instruments, recordings, station bulletins, papers, and scientific correspondence, (2) preservation and reproduction of seismograms and historical documents, especially scanning/digitizing into computer files, and (3) experimentation of techniques for the scientific investigation of all historical seismic data. The subcommittee draws the inspiration for its own purposes and goals from the recent experiences gained in Italy through the TROMOS project, carried out by SGA for the Istituto Nazionale di Geofisica and in Europe through the... You do not have access to this content, please speak to your institutional administrator if you feel you should have access.

  • Conference Article
  • Cite Count Icon 3
  • 10.5167/uzh-197209
How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR
  • Jun 2, 2020
  • Phillip Ströbel + 2 more

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.