Abstract Standardized quantitative measurement of texts lies at the heart of digital approaches to humanities. Structure-based textual measures are known to be influenced by the choice of syntactic annotation schemes. Building on previous research, the present article further explores the relation between annotation schemes and the index of mean dependency distance (MDD) by comparing the treebanks of seventeen languages, respectively, within a tree representation (basic universal dependencies, BUD) and within a graphic representation (enhanced universal dependencies, EUD). Following the idea of decomposing annotation schemes into the combinations of analyses of specific constructions (coordinate structures, control constructions, and relative clauses), we design algorithms to identify them in the CoNLL-U format treebanks and explore their influences. It is found that the overall MDD of the EUD representation is statistically higher than that of BUD at corpus level, primarily affected by the coordinate structure due to its high frequency. At sentence level, all three constructions might contribute to either increased or decreased MDD, with stochastically intervening words and word order being two important determinants of the values of the measure. Finally, we propose and argue for the view that MDDs calculated under different annotation schemes should be regarded as different textual measures in nature. In sum, the present study provides another case study to deepen our understanding of the nature of syntactic annotation schemes and its relation with textual indices, which paves the way for standard measurement of texts in future humanities research.
Read full abstract