Domain Differences in the Distribution of Parts of Speech and Dependency Relations in Hungarian

Veronika Vincze

doi:10.1080/09296174.2013.830553

Abstract

In this article we present some statistical data on the distribution of parts of speech and dependency relations in a large manually annotated Hungarian Treebank, the Szeged Dependency Treebank. We hypothesize that the domain of the text influences the distribution of the above elements, thus we pay special attention to differences between domains. We present the characteristic rank-frequency distributions of parts of speech and dependency relations in Hungarian and analyse the domain similarities and differences among sub-corpora as regards the above distributions. Our results reveal that the computer and newspaper texts are most similar to each other while the domains literature and compositions also exhibit some similarities. On the other hand, the business news and the law sub-corpora are unique, both having their own characteristics.

Full Text