Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Sanjanasri JP,Vijay Krishna Menon,Agnieszka Wolk,Soman KP,Rajendran S

doi:10.3390/electronics10121372

Sanjanasri JP, Vijay Krishna Menon + Show 3 more

Open Access

https://doi.org/10.3390/electronics10121372

Copy DOI

Journal: Electronics	Publication Date: Jun 8, 2021
Citations: 3	License type: CC BY 4.0

Affiliation: Amrita Vishwa Vidyapeetham University

Abstract

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Highlights

Introduction published maps and institutional affilThe mapping of a word to a representation of its meaning is termed semantic representation
The pre-trained encoder of Machine Translation–Long Short-Term Memory (MT-LSTM), Contextual word Vectors (CoVe), is applied across various downstream Natural Language Processing (NLP) tasks such as sentiment analysis and question classifier based on the transfer learning idea
Multi-Layer Perceptron (MLP), being fully connected, are unable to ignore noisy aspects of the data, whereas Convolutional Neural Networks (CNN) is ideally suited for disregarding noise and filtering in the aspects that are most prominent in the data

Summary

Overview

Cross-lingual embedding is accomplished by mapping the vectors from one language’s embedding space into that of the other language through a transfer function. Multiple experiments with various methodologies are carried out to obtain target word vectors for English–Tamil language pairs. The trained cross-lingual model, Transfer Function-based Generated. Pre-trained Hindi and Chinese embeddings (Word2Vec) were piped through the cross-lingual model on the target side to show the sharing property (transferability). The generated embeddings were further validated with real NLP tasks such as Text Summarisation, a multi-class model of the Part-Of-Speech Tagging and Bilingual Dictionary Induction (BDI) for low-resource languages featuring Tamil

Motivation

Bilingual Embeddings and TFGE

Case of a Low-Resource Target Language

Premise

State-of-the-Art Transfer Learning Techniques in NLP

Dataset Description

Learning Transfer Functions

Linear Mapping

Multi-Layer Perceptron

One Dimensional—Convolutional Neural Network

Comparison of Various Monolingual Word Embedding Models

Evaluation Tasks

Quantitative Evaluation

Pairwise Accuracy of Similar Words

Neighborhood Accuracy

Qualitative Evaluation

Evaluation Based on Usability Tests

Text Summarization

Bilingual Dictionary Induction

Results and Discussion

Quantitative Evaluation Results

Usability Evaluation Results

10. Discussions

11. Conclusions

12. Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Supervised Bilingual Word Embeddings for Low-Resource Language Pairs: Myanmar and Thai
Zar Zar Hlaing ... Thazin Myint Oo
-
Zar Zar Hlaing, et. al.Zar Zar Hlaing ... Thazin Myint Oo
21 Dec 2021
21 Dec 2021

Bilingual Lexicon Induction through Unsupervised Machine Translation
Mikel Artetxe ... Eneko Agirre
-
Mikel Artetxe, et. al.Mikel Artetxe ... Eneko Agirre
01 Jan 2019
01 Jan 2019

Improving Word Translation via Two-Stage Contrastive Learning
Yaoyiran Li ... Ivan Vulić
-
Yaoyiran Li, et. al.Yaoyiran Li ... Ivan Vulić
01 Jan 2021
01 Jan 2021

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics