Abstract

The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.

Highlights

  • The earliest urban civilization of the Indian subcontinent flourished in the valley of the river Indus and its surroundings during the Bronze Age

  • We first present the results of an empirical statistical analysis of the Extended Basic Unique Data Set (EBUDS) corpus

  • EBUDS is a filtered corpus created from M77 to remove duplicates and ambiguities

Read more

Summary

Introduction

The earliest urban civilization of the Indian subcontinent flourished in the valley of the river Indus and its surroundings during the Bronze Age. At its peak, in the period between 2600 BCE and 1900 BCE [1], it covered approximately a million square kilometers [2], making it the largest urban civilization of the ancient world. The Indus people used a script, which has mainly survived on seals (see Fig. 1 for an example), pottery, and other artifacts made of durable materials such as stone, terracotta and copper. The script occurs usually in short texts, numbering not more than 14 signs in a single line of text. Obstacles to the decipherment of the sign system include the paucity of long texts, the absence of bilingual text, and the lack of any definite knowledge of the underlying language(s) the script may have expressed

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call