Summary La Nature (1873–1962) is a French popular science magazine that spanned a large time period and a large range of topics. It is available via ocerized archives so that it forms a corpus that is simultaneously diachronous, heterogeneous, and noisy. Although these characteristics make it complex to analyze, La Nature is of great interest for digital humanities studies on the evolution of thoughts in science, technology, and even politics. The work presented in this article is part of research on the semantic annotation of these archives, which is discovering clues for exploring them. One type of clue that has not been explored in a complex corpus such as La Nature is binomial names, or more specifically, the named entities that refer to the Linnean classification of life, e.g., Escherichia coli. To overcome this complexity, the concept of a Competent Reader, who can detect binomial names even when obsolete, non-standard or defaced by OCR, is introduced. By imitating a Competent Reader, our approach, which we call the Competent Reader Imitator (CRI), involves combining a rule-based approach with a frequency argument. We show that this innovative method is robust to numerous variations and consistently achieves an F-measure of about 70% despite diachronicity, heterogeneity, and noise, which are all known to impede named entity recognition. Our method has many potential applications, such as in the study of chemical names and names of scientific and technical artifacts, which could benefit from the Competent Reader imitation approach. Beyond our work on La Nature, we hope this paper provides a set of tools and methods that are easily understandable, frugal, and usable for a general public interested in exploring similar historical corpus.
Read full abstract