Hybrid embeddings for transition-based dependency parsing of free word order languages

Fatima Tuz Zuhra,Khalid Saleem

doi:10.1016/j.ipm.2023.103334

Abstract

Neural Dependency parsing relies on embeddings such as word embeddings and part of speech (POS) embeddings. We propose embeddings which convey more meanings in case of Arabic scripted, morphologically rich, free word order languages. In such languages, part of speech (POS) and morphological features (feats) of a particular word in a sentence govern the suffixes of another word in the same sentence. Keeping this in view, we augment the famous quote “a word is known by the company it keeps” and propose that “a POS is known by the company of suffixes it keeps” and “a morphological feat is known by the company of suffixes it keeps”. We propose two novel embeddings which are XPOSngram and FEATSngram embeddings. These embeddings are trained on heterogeneous items i.e. the pair of language specific POS (XPOS) and n-grams, referred to as ‘XPOSngram’; and morphological feats and n-grams, called ‘FEATSngram’. We call these new type of embeddings hybrid embeddings. We perform experiments on five treebanks, taken from universal dependencies (UD), which belong to four Arabic-scripted, morphologically rich, free word order, and low-resource languages (i.e. Urdu, Arabic, Persian and Uyghur). These treebanks consist of 42985 sentences in total. The experimental results show that on the average, the proposed approach has ≈1.24%, ≈0.84% and ≈3.31% gain in unlabelled attachment score (UAS) over the state of the art language specific POS embeddings, universal POS embeddings and n-gram embeddings based approaches respectively. We have compared the results of hybrid embeddings for Arabic language with the state of the art ArWordVec embeddings. The proposed solution achieves UAS which is ≈10.27% higher than the UAS achieved by ArWordVec. We have further compared the results of hybrid embeddings of Urdu with two state of the art Urdu word embeddings. The results show that the best hybrid embedding has a UAS ≈3.32% and ≈5.015% higher than the two embeddings. We have also tested the proposed methodology for five treebanks of non-Arabic scripted languages from the UD, which are Belarusian, Dutch, German, Greek, and Hungarian languages. The experimental results demonstrate that the proposed approach not only outperform for Arabic scripted languages, but generalizes well for non-Arabic scripted, free word order languages with an average gain of ≈2.5%, ≈2.8% and ≈7.5% in UAS over the state of the art XPOS, UPOS and n-gram based approaches.

Full Text