Abstract

We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word’s length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.

Highlights

  • Words are a basic unit for the expression of meanings, but the mapping between words and meanings is many-to-many

  • We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus

  • The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves

Read more

Summary

Introduction

Words are a basic unit for the expression of meanings, but the mapping between words and meanings is many-to-many. We have a preference for one word over another when we select a word from a set of synonyms in order to convey a meaning, and generally one sense of a polysemous word is more likely than the other senses. GBNC provides us with information about how word frequencies change over time and WordNet allows us to relate words to their meanings. According to WordNet, ecstatic, enraptured, rapt, rapturous, and rhapsodic all belong to the same synset, when they are tagged as adjectives (enraptured could be the past tense of the verb enrapture). They all mean “feeling great rapture or delight.”. Brandon [8] states that the following three components are crucial to evolution by natural selection: 1. Variation: There is (significant) variation in morphological, physiological and behavioural traits among members of a species

Differential Fitness
Related work on the evolution of words
Experiments with modeling change
Experiments with NBCP
Experiments with varying time periods
Future work and limitations
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call