Abstract

Long-range correlation, a property of time series exhibiting relevant statistical dependence between two distant subsequences, is mainly studied in the statistical physics domain and has been reported to exist in natural language. By using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, generative stochastic models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model is found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Because the Simon model is known not to correctly reflect the vocabulary growth of natural languages, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one cause of long-range correlation and could thus have some relation with actual linguistic processes.

Highlights

  • State-of-the-art generative mathematical models of language include the Simon and Pitman-Yor models and their extensions (Pitman, 2006; Chater and Oaksford, 2008; Lee and Wagenmakers, 2014)

  • Using the method introduced in the previous section, this section introduces a kind of data that has never been considered in the context of long-range correlation: child-directed speech corpus, CHILDES (MacWhinney, 2000)

  • Because long-range correlation is due to the arrangement of frequent words and rare words, FIGURE 10 | Rank-frequency distribution, type-token relation, and autocorrelation function for a sequence of one million elements generated by a Pitman-Yor model with a = 0.68 and b = 0.80

Read more

Summary

INTRODUCTION

State-of-the-art generative mathematical models of language include the Simon and Pitman-Yor models and their extensions (Pitman, 2006; Chater and Oaksford, 2008; Lee and Wagenmakers, 2014). Heaps’ law describes how the growth of vocabulary forms a power law with respect to the total size (Guiraud, 1954; Herdan, 1964; Heaps, 1978); the Pitman-Yor model follows this principle well In this paper, another power law underlying the autocorrelation function of natural language is considered. Verification of the universality of long-range correlation in language is an ongoing topic of study and has been reported across domains. In linguistics, it has been shown through hand counting how rare words cluster in the Iliad (van Emde Boas, 2004). The article discusses the relation between uniform sampling and linguistic procedures

QUANTIFICATION OF LONG-RANGE
AUTOCORRELATION FUNCTIONS FOR
GENERATIVE LANGUAGE MODELS
DISCUSSION
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.