Abstract This study quantitatively examines the first five universals of Greenberg’s basic word order typology based on 74 large-scale annotated corpora from two perspectives. The results show that (1) the dominant orders extracted from corpora concur with those retrieved from the World Atlas of Language Structures (henceforth, WALS) and provide knowledge of dominant orders to languages absent in the WALS, demonstrating the feasibility of adopting corpora to determine dominant orders in typological studies; (2) approaching word order as a discrete variable suggests that the relative order of adjective and noun cannot be predicted by the relative orders of object and verb and genitive and noun, which means the violation of Greenberg’s related universal; (3) approaching word order as a continuous variable also indicates the violation of this universal; and (4) the language samples based on the annotated corpora database further demonstrates that languages that are in line with this universal are rare and internally heterogeneous. Our findings suggest the possibility of drawing typological conclusions based on the frequencies and probabilities extracted from corpora materials and demonstrate that a more cautious adoption of the well-known universals is needed, indicating the importance of viewing word order features from various perspectives to better capture the characteristics of natural languages.
Read full abstract