Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

Paheli Bhattacharya,Sudeshna Sarkar,Pawan Goyal

doi:10.1145/3208358

Abstract

We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations. In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Dec 17, 2018
Citations: 8

Similar Papers

Unsupervised Multilingual Word Embeddings
Xilun Chen ... Claire Cardie
-
Xilun Chen, et. al.Xilun Chen ... Claire Cardie
01 Jan 2018
01 Jan 2018

Explorations into the Use of Word Embedding in Math Search and Math Semantics
Abdou Youssef ... Bruce R Miller
-
Abdou Youssef, et. al.Abdou Youssef ... Bruce R Miller
01 Jan 2019
01 Jan 2019

Using Word Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval
Paheli Bhattacharya ... Pawan Goyal
Computación Y Sistemas | VOL. 20
Paheli Bhattacharya, et. al.Paheli Bhattacharya ... Pawan Goyal
30 Sep 2016
Computación Y Sistemas | VOL. 20

To translate or not to translate?
Chia-Jung Lee ... Chin-Hui Chen
-
Chia-Jung Lee, et. al.Chia-Jung Lee ... Chin-Hui Chen
19 Jul 2010
19 Jul 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing