Abstract Syntactic parsing is one of the areas in Natural Language Processing. The development of large-scale multilingual language models has enabled cross-lingual parsing approaches, which allows us to develop parsers for languages that do not have treebanks available. However, these approaches rely on the assumption that languages share orthographic representations and lexical entries. In this article, we investigate methods for developing a dependency parser for Xibe, a low-resource language that is written in a unique script. We first investigate lexicalized monolingual dependency parsing experiments to examine the effectiveness of word, part-of-speech, and character embeddings as well as pre-trained language models. Results show that character embeddings can significantly improve performance, while pre-trained language models decrease performance since they do not recognize the Xibe script. We also train delexicalized monolingual models, which yield competitive results to the best lexicalized model. Since the monolingual models are trained on a very small training set, we also investigate lexicalized and delexicalized cross-lingual models. We use six closely related languages as source language, which cover a wide range of scripts. In this setting, the delexicalized models achieve higher performance than lexicalized models. A final experiment shows that we can increase performance of the cross-lingual model by combining source languages and selecting the most similar sentences to Xibe as training set. However, all cross-lingual parsing results are still considerably lower than the monolingual model. We attribute the low performance of cross-lingual methods to syntactic and annotation differences as well as to the impoverished input of Universal Dependency Part-of-Speech tags that the delexicalized model has access to.
Read full abstract