Learning Neural Representation for CLIR with Adversarial Framework

Bo Li,Ping Cheng

doi:10.18653/v1/d18-1212

Abstract

The existing studies in cross-language information retrieval (CLIR) mostly rely on general text representation models (e.g., vector space model or latent semantic analysis). These models are not optimized for the target retrieval task. In this paper, we follow the success of neural representation in natural language processing (NLP) and develop a novel text representation model based on adversarial learning, which seeks a task-specific embedding space for CLIR. Adversarial learning is implemented as an interplay between the generator process and the discriminator process. In order to adapt adversarial learning to CLIR, we design three constraints to direct representation learning, which are (1) a matching constraint capturing essential characteristics of cross-language ranking, (2) a translation constraint bridging language gaps, and (3) an adversarial constraint forcing both language and media invariant to be reached more efficiently and effectively. Through the joint exploitation of these constraints in an adversarial manner, the underlying cross-language semantics relevant to retrieval tasks are better preserved in the embedding space. Standard CLIR experiments show that our model significantly outperforms state-of-the-art continuous space models and is better than the strong machine translation baseline.

Highlights

Text representation is a crucial problem in most natural language processing (NLP) and information retrieval (IR) tasks
We argue that task-specific embeddings are superior, a fact that is inspired by monolingual IR studies and that will be validated by cross-language information retrieval (CLIR) experiments in this paper
We propose a novel text representation approach for CLIR based on the adversarial learning framework

Summary

Introduction

Text representation is a crucial problem in most natural language processing (NLP) and information retrieval (IR) tasks. In monolingual IR, early research works mostly use vector space models for query-document semantic matching (Salton et al, 1975), which suffer from the problem of synonymy and polysemy. In order to bridge the lexical gaps, latent semantic models such as latent semantic analysis (LSA) (Deerwester et al, 1990) have been proposed to abstract away from surface text forms to approximate semantics. In addition to document ranking, CLIR models need to cross the language barriers, which makes the task intuitively more difficult than monolingual IR. Traditional approaches reduce CLIR to its monolingual counterpart via performing some way of translation on queries or/and documents. Dictionary-based approaches suffer from the same problem of lexical gaps as in the monolingual case (Gupta et al, 2017). An efficient cross-language representation is in need for CLIR, which is expected to be able to cross both the language and lexical gaps

Methods

Results

Conclusion