A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

Jivnesh Sandhan,Amrith Krishna,Ashim Gupta,Pawan Goyal,Laxmidhar Behera

doi:10.18653/v1/2021.eacl-srw.16

Abstract

Neural dependency parsing has achieved remarkable performance for many domains and languages. The bottleneck of massive labelled data limits the effectiveness of these approaches for low resource languages. In this work, we focus on dependency parsing for morphological rich languages (MRLs) in a low-resource setting. Although morphological information is essential for the dependency parsing task, the morphological disambiguation and lack of powerful analyzers pose challenges to get this information for MRLs. To address these challenges, we propose simple auxiliary tasks for pretraining. We perform experiments on 10 MRLs in low-resource settings to measure the efficacy of our proposed pretraining method and observe an average absolute gain of 2 points (UAS) and 3.6 points (LAS).

Highlights

Dependency parsing has greatly benefited from neural network-based approaches
Input representation consists of FastText (Grave et al, 2018)4 embedding of 300-dimension and convolutional neural network (CNN) based 100-dimensional character embedding (Zhang et al, 2015)
We focused on dependency parsing for low-resource morphological rich languages (MRLs), where getting morphological information itself is a challenge

Summary

Introduction

Dependency parsing has greatly benefited from neural network-based approaches. While these approaches simplify the parsing architecture and eliminate the need for hand-crafted feature engineering (Chen and Manning, 2014; Dyer et al, 2015; Kiperwasser and Goldberg, 2016; Dozat and Manning, 2017; Kulmizev et al, 2019), their performance has been less exciting for several morphologically rich languages (MRLs) and low-resource languages (More et al, 2019; Seeker and Cetinoglu, 2015). Several approaches have been suggested for improving the parsing performance of low-resource languages This includes data augmentation strategies, cross-lingual transfer (Vania et al, 2019) and using unlabelled data with semi-supervised learning (Clark et al, 2018) and self-training (Rotman and Reichart, 2019). Incorporating morphological knowledge substantially improves the parsing performance for MRLs, including lowresource languages (Vania et al, 2018; Dehouck and Denis, 2018). This aligns well with the linguistic intuition of the role of morphological markers, especially that of case markers, in deciding the syntactic roles for the words involved (Wunderlich and Lakamper, 2001; Sigursson, 2003; Kittilaet al., 2011). We primarily focus on one such morphologicallyrich low-resource language, Sanskrit

Methods

Results

Conclusion