Locality and expectation effects in Hindi preverbal constituent ordering

Sidharth Ranjan,Rajakrishnan Rajkumar,Sumeet Agarwal

doi:10.1016/j.cognition.2021.104959

Abstract

We investigate the relative impact of two influential theories of language comprehension, viz., Dependency Locality Theory (Gibson, 2000; DLT) and Surprisal Theory (Hale, 2001; Levy, 2008), on preverbal constituent ordering in Hindi, a predominantly SOV language with flexible word order. Prior work in Hindi has shown that word order scrambling is influenced by information structure constraints in discourse. However, the impact of cognitively grounded factors on Hindi constituent ordering is relatively underexplored. We test the hypothesis that dependency length minimization is a significant predictor of syntactic choice, once information status and surprisal measures (estimated from n-gram i.e., trigram and incremental dependency parsing models) have been added to a machine learning model. Towards this end, we setup a framework to generate meaning-equivalent grammatical variants of Hindi sentences by linearizing preverbal constituents of projective dependency trees in the Hindi-Urdu Treebank (HUTB) corpus of written text. Our results indicate that dependency length displays a weak effect in predicting reference sentences (amidst variants) over and above the aforementioned predictors. Overall, trigram surprisal outperforms dependency length and parser surprisal by a huge margin and our analyses indicate that maximizing lexical predictability is the primary driving force behind preverbal constituent ordering choices in Hindi. The success of trigram surprisal notwithstanding, dependency length minimization predicts non-canonical reference sentences having fronted direct objects over variants containing the canonical word order, cases where surprisal estimates fail due to their bias towards frequent structures and word sequences. Locality effects persist over the Given-New preference of subject-object ordering in Hindi. Accessibility and local statistical biases discussed in the sentence processing literature are plausible explanations for the success of trigram surprisal. Further, we conjecture that the presence of case markers is a strong factor potentially overriding the pressure for dependency length minimization in Hindi. Finally, we discuss the implications of our findings for the information locality hypothesis and theories of language production.

Full Text