Abstract

The Google Books Ngram Viewer (Google Ngram) is a search engine that charts word frequencies from a large corpus of books and thereby allows for the examination of cultural change as it is reflected in books. While the tool’s massive corpus of data (about 8 million books or 6% of all books ever published) has been used in various scientific studies, concerns about the accuracy of results have simultaneously emerged. This paper reviews the literature and serves as a guideline for improving Google Ngram studies by suggesting five methodological procedures suited to increase the reliability of results. In particular, we recommend the use of (I) different language corpora, (II) cross-checks on different corpora from the same language, (III) word inflections, (IV) synonyms, and (V) a standardization procedure that accounts for both the influx of data and unequal weights of word frequencies. Further, we outline how to combine these procedures and address the risk of potential biases arising from censorship and propaganda. As an example of the proposed procedures, we examine the cross-cultural expression of religion via religious terms for the years 1900 to 2000. Special emphasis is placed on the situation during World War II. In line with the strand of literature that emphasizes the decline of collectivistic values, our results suggest an overall decrease of religion’s importance. However, religion re-gains importance during times of crisis such as World War II. By comparing the results obtained through the different methods, we illustrate that applying and particularly combining our suggested procedures increase the reliability of results and prevents authors from deriving wrong assumptions.

Highlights

  • Since its launch in 2010, the possibilities and limitations of using the Google Books Ngram Viewer (Google Ngram) for research purposes have been controversially discussed

  • Google Ngram allows for hands-on quantification of cultural change using millions of books

  • To the best of Guideline for improving the reliability of Google Ngram studies our knowledge, this is the first summary of the tool’s limitations that comes with a set of methodological procedures that are suited to improve the reliability of results

Read more

Summary

Introduction

Since its launch in 2010, the possibilities and limitations of using the Google Books Ngram Viewer (Google Ngram) for research purposes have been controversially discussed. Because early decades contain significantly fewer books, the overall corpus of Google Ngram becomes sufficiently large for scientific use by the year 1800 [4]. When the tool was released in 2010, the total corpus consisted of more than 5 million books, covering the languages English, French, Spanish, German, Chinese, Russian, and Hebrew. These books were drawn from over 40 different university libraries [4]. There are two fiction corpora, which include predominately English fiction books and one corpus, called “English one Million”, which includes a balanced text-collection of 6000 English language books, published between 1500 and 2008, and chosen from any one year

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call