Abstract
In the last 20 years, topic modeling and the application of LDA (latent Dirichlet allocation) model in particular has become one of the most commonly used techniques for exploratory analysis and information retrieval from textual sources. Although topic modeling has been used to conduct research in a large number of projects, the technology has not yet become a part of the common standard functionalities of digital historical collections that are curated by the libraries, archives and other memory institutions. Moreover, many common and well researched natural language processing techniques, including topic modeling, have not been sufficiently applied to working with sources of small or low-resource languages, including Latvian. The paper reports the results of the first case study where the LDA methodology has been used to analyze a data set of historical newspapers in Latvian. The corpus of the newspaper Latvian Soldier is used to conduct the analysis, focusing on the performance of the topics related to the first commander of Latvian army Oskars Kalpaks as an example. In the research of digital humanities, the results of the topic modeling have been used and interpreted in several distinct ways depending on the type and genre of the text, e.g., to acquire semantically coherent, trustworthy lists of keywords, or to extract lexical features that do not aid thematic analysis but provide other insights about the usage of language instead. The authors of this paper propose applications that could be most suitable for the analysis of historical newspapers in large digital collections of memory institutions, as well as recount the challenges related to working with textual sources that contain optical recognition errors, problematic segmentation of articles and other issues pertaining to digitized noncontemporary data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.