Abstract
Using the re-emergence of the /h/ onset from Early Modern to Present-Day English as a case study, we illustrate the making and the functions of a purpose-built web application named (an:a) lyzer for the interactive visualization of the raw n-gram data provided by Google Books Ngrams (GBN). The database has been compiled from the full text of over 4.5 million books in English, totalling over 468 billion words and covering roughly five centuries. We focus on bigrams consisting of words beginning with graphic <h> preceded by the indefinite article allomorphs a and an, which serve as a diagnostic of the consonantal strength of the initial /h/. The sheer size of this database affords us the possibility to attain a maximal diachronic resolution, to distinguish highly specific groups of <h>-initial lexical items, and even to trace the diffusion of the observed changes across individual lexical units. The functions programmed into the app enable us to explore the data interactively by filtering, selecting and viewing them according to various parameters that were manually annotated into the data frame. We also discuss limitations of the database, of the app and of the explorative data analysis. The app is publicly accessible online at https://osf.io/ht8se/.
Highlights
With the release of databases such as the Corpus of Global Web-based English (GloWbE; 1.9 billion words; https://www.english-corpora.org/glowbe/), the News on the Web Corpus (; 8.7 billion words and growing; https://www.english-corpora.org//; accessed 26 February 2020) and the Intelligent Web-based Corpus, linguists have witnessed the latest peak in a trend towards ever larger databases becoming available free of charge for linguistic investigation
Even in the area of historical English linguistics, which by its very nature has to rely on texts that happen to have survived from former centuries and that are for that reason typically more restricted in scope and quantity, huge databases have been accumulated
Google Books Ngrams ( GBN) represent the text of about 6 % of all books ever published from the 1500s to 2008, and over 50 % of this text is in English
Summary
With the release of databases such as the Corpus of Global Web-based English (GloWbE; 1.9 billion words; https://www.english-corpora.org/glowbe/), the News on the Web Corpus (; 8.7 billion words and growing; https://www.english-corpora.org//; accessed 26 February 2020) and the Intelligent Web-based Corpus (iWeb; 14 billion words; https://www.englishcorpora.org/iweb/), linguists have witnessed the latest peak in a trend towards ever larger databases becoming available free of charge for linguistic investigation. Even in the area of historical English linguistics, which by its very nature has to rely on texts that happen to have survived from former centuries and that are for that reason typically more restricted in scope and quantity, huge databases have been accumulated. Besides issues of lacking metatextual information, what sets these large datasets apart from smaller corpora from a technical point of view is that they usually come with their own specialized software, e.g. the English Corpora interface (https://www.englishcorpora.org/) maintained by Mark Davies, or the Google Books Ngram Viewer (https://books.google.com/ngrams) This becomes problematic when the interface does not provide the function(s) necessary for a specific methodological approach. Journal of Data Mining and Digital Humanities ISSN 2416-5999, an open-access journal http://jdmdh.episciences.org Though it seems a long way from big and messy collections of centuries-old prints (automatically converted into machine-readable text by modern software) to phonetic distinctions too minute for listeners to perceive (near-mergers), Schlüter shows that the large amount of data allows for an analytical precision for historical data that is comparable to the acoustic measurements of near-mergers obtained in modern phonetics labs (despite being of a completely different nature). Lemma variants yea r count_a_year volumes_a_year count_an_year volumes_an_year onset j_glide origin initial_stress v_quantity variety words volumes herb hearbe mute < h > in US only no /j/
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.