Abstract

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

Highlights

  • Methods from statistics, statistical physics, and artificial intelligence have increasingly been used to analyze large volumes of text for a variety of applications [1,2,3,4,5,6,7,8,9,10,11] some of which are related to fundamental linguistic and cultural phenomena

  • The methods for text analysis we consider can be classified into three broad classes: (i) those based on first-order statistics where data on classes of words are used in the analysis, e.g. frequency of words [18]; (ii) those based on metrics from networks representing text [3,4,8,9,19], where adjacent words are directionally connected according to the natural reading order; (iii) those using intermittency concepts and time-series analysis for texts [4,5,6,7,20,21,22,23]

  • We report the statistical analysis of different measurements X across different texts and languages

Read more

Summary

Introduction

Methods from statistics, statistical physics, and artificial intelligence have increasingly been used to analyze large volumes of text for a variety of applications [1,2,3,4,5,6,7,8,9,10,11] some of which are related to fundamental linguistic and cultural phenomena. The obvious disadvantages are related to the superficial nature of the analysis, for even simple linguistic phenomena such as lexical disambiguation of homonymous words are very hard to treat. Another limitation in these statistical methods is the need to identify the representative features for the phenomena under investigation, since many parameters can be extracted from the analysis but there is no rule to determine which are really informative for the task at hand. There may be variability among texts of the same language, especially owing to semantic issues

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call