ABSTRACT With the exponential growth of digitized historical materials, historians and social scientists face the daunting task of navigating vast online collections. While online archives do provide search tools to query their materials, these usually do not meet the scholars’ need to trace and store all relevant materials from the online archive. This case study shows how computer vision methods, and more specifically a signal processing approach, can be used to identify, classify and extract relevant information items from a vast historical data collection. More specifically, this study reports the results of the construction of a data pipeline that extracts matrimonial advertisements from the digitized Catalan newspaper La Vanguardia. Matrimonial advertisements can provide genuine insights into the evolution of partner preferences over time, but they are hard to collect as they are scattered over millions of digitized historical newspapers and magazines. Moreover, to study variation in partner preferences by, for example, sex, social class, matrimonial status and time period, it is necessary to store the data into a database. The pipeline that we have constructed extracts matrimonial advertisements in a stepwise fashion, encompassing identification, through binarisation and segmentation, and classification based on Optical Character Recognition. By ways of a comprehensive evaluation, both qualitatively and quantitatively, the efficacy of the pipeline for the extraction of matrimonial advertisements is demonstrated. The findings not only underscore the viability of the signal processing approach but also underscore its potential for advancing research in historical demography, family history, as well as economic history, as similar pipelines can be set up to extract other relevant newspaper items, such as, marriage, birth, death and moving announcements, job vacancies or business announcements from digitized source collections.
Read full abstract