Abstract

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilingual word embeddings (MLWEs) have been critical mechanisms for facilitating knowledge transfer between languages, notably between resource-rich and resource-lean languages. However, recent studies have discovered that the isomorphism assumption is not widely applicable, and that approaches based on this assumption face significant limitations, leading to stagnation in multilingual natural language processing (MNLP). Instead of pursuing the higher quality of MLWEs, we propose MUSEDA (Multilingual Unsupervised and Supervised Embedding for Domain Adaption), a framework for building multilingual word embeddings for domain transfer learning, including pivot features (domain-keyword) mining and weighted embedding alignment in supervised or unsupervised methods. Without using any additional knowledge and parallel data, pivot words can be mined from current data and introduced into the embedding alignment as the weight factor. We applied our framework for real-world tasks from different domains: sentiment classification, news categorization, and named entity recognition on lean-source languages, including Arabic, Vietnamese, etc., and the experimental results demonstrate that the proposed method has surpasses the baseline approaches. In addition, to alleviate the insufficient corpus in certain domains of a single source language, we propose a many-to-many (M2M) mode to further enhance experimental performance by integrating multi-source language into a large common space.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call