The study aims to establish whether language models trained on unlabeled text data can parametrize agreement variation. We compared the acceptability judgments made by native speakers and the probability metrics predicted by the language model ruBERT without fine-tuning. As a specific linguistic phenomenon, we considered predicate agreement with a coordinated subject in Russian. We analyzed in detail which syntactic, morphological, and semantic factors influenced sentence acceptability and probability. The experimental data enables us to reveal the role of each factor and their interaction. Besides the standard logarithmic probability, we considered sentence length and unigram probability. We assumed that the model would assign the highest probability to the most acceptable agreement strategy. However, our hypothesis was not confirmed: the correlation between probability and acceptability is lower for sentences with agreement variation than for sentences without variation. The linear position — the subject-predicate order and the conjuncts’ order — turned out to be the only factor which equally influences the acceptability and probability of a sentence. If the gender features of conjuncts match, the acceptability of singular agreement increases while the probability does not change. The animacy of conjuncts and the predicate symmetry influence neither acceptability nor probability. Our research demonstrates that ruBERT cannot be used to parametrize predicate agreement with a coordinated subject. The acceptability of a sentence is based on subtle linguistic contrasts which are not significant for the computer evaluation of its probability.
Read full abstract