Abstract

Abstract In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) which comes along with revisited and extended annotation guidelines. In order to assess the quality of our annotations as well as the usability and limits of the guidelines we performed an inter-annotator agreement test. The results show that agreement between annotators is hampered by various factors, most prominently by insufficient understanding of the content because of the cultural and temporal gap and incomplete knowledge of Vedic grammar. An in-depth discussion of disagreeing annotations demonstrates that the setup of the workflow, too, has a major influence on inter-annotator agreement. We suggest some measures that can help increase the transparency and annotation consistency according to current knowledge of the language when annotating Vedic Sanskrit, or ancient language varieties in general.

Highlights

  • Treebanks have become indispensable tools for studying syntactic and morphological phenomena and for enhancing Natural Language Processing.While earlier endeavors in annotating syntactic structure were largely confined to modern languages, an increasing number of treebanks of ancient languages has been published in recent years

  • Our paper follows in the wake of other contributions concerned with the process of building linguistic resources for ancient languages, such as the proiel treebanks1 of early Indo-European languages (Eckhoff et al 2018a,b), the ittb2 (Passarotti 2019), the Ancient Greek and Latin Dependency Treebank3 or, outside of the Indo-European domain, the treebank of Old Chinese4 (Yasuoka 2019), and with the potential that annotated corpora have for the study of ancient languages (Eckhoff et al 2018b)

  • As sentence segmentation turned out to be a source of considerable disagreement, we report a third setting ‘cleaned-sameSeg’ for the evaluation of the actual syntactic annotation 5.2

Read more

Summary

Introduction

Treebanks have become indispensable tools for studying syntactic and morphological phenomena and for enhancing Natural Language Processing (nlp).While earlier endeavors in annotating syntactic structure were largely confined to modern languages (e.g. the Penn treebank), an increasing number of treebanks of ancient languages has been published in recent years. The indigenous tradition partitions the texts with vertical strokes (|, daṇḍa) only at higher levels of compositional complexity (books, chapters, paragraphs in prose; stanzas and hemistiches in metrical texts), but does not feature a punctuation system that structures utterances, clauses, sentences and their constituents For these reasons, sentence-segmentation must be performed manually as part of the annotation process. Such clitics can depend on any noun in the clause or on the verb, and alternative interpretations of the text lead to alternative dependencies, all of which are acceptable from the point of view of Vedic grammar. In example (16), which is taken from a Ṛgvedic hymn addressing the god Indra, the adjective priyám ‘dear’ can be interpreted as an attribute of mánma ‘thought’ (label amod) or, alternatively, as a depictive secondary predicate (label acl:dpct; Schultze-Berndt & Himmelmann 2004; Himmelmann & Schultze-Berndt 2005; Casaretto 2020) meaning that Indra, the addressee of the hymn, does not generally rejoice at every thought, but only when the thoughts are dear to him

Summary and Outlook
Findings
Nominal
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.