Abstract

To overcome the lack of NLP resources for the low-resource languages, we can utilize tools that are already available for other highresource languages and then modify the output to conform to the target language. In this study, we proposed an approach to convert an Indonesian constituency treebank to a dependency treebank by utilizing an English NLP tool (Stanford CoreNLP) to create the initial dependency treebank. Some annotations in this initial treebank did not conform to Indonesian grammar, especially noun phrases’ head-directionality. Noun phrases in English usually have head-final direction, while in Indonesian is the opposite, head-initial. We proposed a variant of tree rotations algorithm named headSwap for dependency trees. We used this algorithm to convert the head-directionality for noun phrases that were initially labeled as a compound. Moreover, we also proposed a set of rules to rename the dependency relation labels to conform to the recent guidelines. To evaluate our proposed method, we created a gold standard of 2,846 tokens that were annotated manually. Experiment results showed that our proposed method improved the Unlabeled Attachment Score (UAS) with a margin of 32.5% from 61.6 to 94.1% and the Labeled Attachment Score (LAS) with a margin of 41% from 44.1 to 85.1%. Finally, we created a new Indonesian dependency treebank that converted automatically using our proposed method that consists of 25,416 tokens. The dependency parser model built using this treebank has UAS of 75.90% and LAS of 70.38%.

Highlights

  • Syntactic parsing is “a task of recognizing an input string and assigning a structure to it” (Jurafsky and Martin, 2008)

  • This treebank uses the same format as the Penn Treebank, both the Part-Of-Speech (POS) tagset, the bracketing and the annotation guidelines, which makes Kethu suitable as the input for the Stanford Universal Dependencies (SUD)+ converter that designed for the Penn Treebank

  • We proposed an approach to revise automatically a dependency treebank of a low-resource language that was initially produced by an Natural Language Processing (NLP) tool of a high-resource language

Read more

Summary

Introduction

Syntactic parsing is “a task of recognizing an input string and assigning a structure to it” (Jurafsky and Martin, 2008). The dependency parsing has gained more popularity because of its applicability to a wide range of NLP tasks such as machine translation (Čmejrek et al, 2004; Galley and Manning, 2009; Jiang et al, 2016; Gao et al, 2017), information extraction (Niklaus et al, 2018; Gashteovski et al, 2019), question answering (Meng et al, 2017; Cao et al, 2018) and so on These works have motivated the conversion of the available constituency treebanks to the dependency treebanks. We proposed a method to revise the output of an English NLP tool named Stanford Universal Dependencies (SUD) converter (Schuster and Manning, 2016) so that the resulting treebank conforms to Indonesian grammar. We named our proposed tree rotations algorithms for dependency trees as the headSwap algorithm We use this algorithm to implement a rule to convert the head-directionality of noun phrases that were initially labeled as a compound. The contributions of our work are three-fold: 1. We propose a variant of tree rotations algorithms named headSwap that works on the dependency trees to swap the head between two nodes

We present a case in which the headSwap algorithm can be applied
Related Work
Experiments and Results
Conclusion and Future Work
Funding Information
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call