Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

Xiao Li,Ya-Ting Yang,Yang Yuan

doi:10.3390/info11010024

Abstract

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.

Highlights

Textual data are mainly represented in the vector form: one-hot representation [1] and distributed representation [2]
In order to verify the contributions of the SOP-based distance attenuation function, word-pair
The results show significantly that SOP-based distance attenuation function and word alignment probability can effectively improve the performance of word embedding on the small-scale corpus

Summary

Introduction

Textual data are mainly represented in the vector form: one-hot representation [1] and distributed representation [2]. One-hot representation is a very simple way of counting the total number of the whole words appeared in the text as N. Distributed representation maps the attribute features of words into a set of consecutive dense real vectors discretely, which commonly referred as word embedding. Word embedding is easier for computer recognition, and is usually used in conjunction with distribution theory (words with the same context will have the same or similar semantic relationship) for semantic relation mining, since it contains some grammar and semantic information of the vocabulary. Word embedding is widely utilized in research fields such as data mining, machine translation, automatic question and answer, and information extraction

Methods

Results

Conclusion