Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora

Luca Cagliero,Moreno La Quatra

doi:10.1109/access.2021.3118093

Abstract

The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach.

Highlights

In recent years, distributed vector representations of text have been widely applied to solve complex tasks in Natural Language Processing (NLP) such as sentiment analysis [1], machine translation [2], text categorization [3], and synonym prediction [4].A pioneering word embedding model, namely Word2Vec, was proposed in [5]
The present study focuses on the Word2Vec model because, as discussed later on, it allows both wordlevel domain adaptation and multilingual alignment and still retains a high popularity level in several NLP applications [11]
We differentiate between generalpurpose embeddings ElG, i.e., word embeddings trained on multi-domain, general-interest document corpora such as the whole Wikipedia corpus, and domain-specific embeddings Elδ, which are specialized using document corpora tailored to a specific domain δ

Summary

Introduction

In recent years, distributed vector representations of text have been widely applied to solve complex tasks in Natural Language Processing (NLP) such as sentiment analysis [1], machine translation [2], text categorization [3], and synonym prediction [4].A pioneering word embedding model, namely Word2Vec, was proposed in [5]. The present study focuses on the Word2Vec model because, as discussed later on, it allows both wordlevel domain adaptation and multilingual alignment and still retains a high popularity level in several NLP applications [11]. The goal is to tailor the designed NLP solutions to specific application domains, such as energy [12], biology [15], and industry [14]. Within this scope, unsupervised domain adaptation techniques are appealing, as they allow endusers to fine-tune a general-purpose model even in the absence of labeled data [13], [16]

Objectives

Methods

Results

Conclusion