Neural models for information retrieval without labeled data

Hamed Zamani

doi:10.1145/3458553.3458569

Hamed Zamani

Open Access

PDF Available

https://doi.org/10.1145/3458553.3458569

Copy DOI

Export

Save

Cite

Journal: ACM SIGIR Forum	Publication Date: Dec 1, 2019
Citations: 1	License type: cc-by

Affiliation: University of Massachusetts Amherst

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Recent developments of machine learning models, and in particular deep neural networks, have yielded significant improvements on several computer vision, natural language processing, and speech recognition tasks. Progress with information retrieval (IR) tasks has been slower, however, due to the lack of large-scale training data as well as neural network models specifically designed for effective information retrieval [9]. In this dissertation, we address these two issues by introducing task-specific neural network architectures for a set of IR tasks and proposing novel unsupervised or weakly supervised solutions for training the models. The proposed learning solutions do not require labeled training data. Instead, in our weak supervision approach, neural models are trained on a large set of noisy and biased training data obtained from external resources, existing models, or heuristics. We first introduce relevance-based embedding models [3] that learn distributed representations for words and queries. We show that the learned representations can be effectively employed for a set of IR tasks, including query expansion, pseudo-relevance feedback, and query classification [1, 2]. We further propose a standalone learning to rank model based on deep neural networks [5, 8]. Our model learns a sparse representation for queries and documents. This enables us to perform efficient retrieval by constructing an inverted index in the learned semantic space. Our model outperforms state-of-the-art retrieval models, while performing as efficiently as term matching retrieval models. We additionally propose a neural network framework for predicting the performance of a retrieval model for a given query [7]. Inspired by existing query performance prediction models, our framework integrates several information sources, such as retrieval score distribution and term distribution in the top retrieved documents. This leads to state-of-the-art results for the performance prediction task on various standard collections. We finally bridge the gap between retrieval and recommendation models, as the two key components in most information systems. Search and recommendation often share the same goal: helping people get the information they need at the right time. Therefore, joint modeling and optimization of search engines and recommender systems could potentially benefit both systems [4]. In more detail, we introduce a retrieval model that is trained using user-item interaction (e.g., recommendation data), with no need to query-document relevance information for training [6]. Our solutions and findings in this dissertation smooth the path towards learning efficient and effective models for various information retrieval and related tasks, especially when large-scale training data is not available.

Highlights

Information Retrieval (IR) is a field of science concerned with finding material of mostly unstructured nature to satisfy an information need [102].1 Information retrieval technologies have impacted and are having increasing impact on people’s everyday lives
We use two different query embedding approach, one using average word embedding of query terms (AWE) and one based on the pseudo-query vector (PQV) that uses the top 10 retrieved documents to estimate query embedding
We further study the influence of incorporating multiple weak supervision

Summary

Introduction

Information Retrieval (IR) is a field of science concerned with finding material of mostly unstructured nature to satisfy an information need [102].1 Information retrieval technologies have impacted and are having increasing impact on people’s everyday lives. Efficient algorithms for learning word representations have been proposed that model semantic between words, effectively. GloVe employs a matrix factorization algorithm to decompose a global word-word co-occurrence matrix to two lower ranked matrices Both of these two models have shown promising results in a set of natural language processing tasks. Quality estimation is a fundamental task that can help to improve effectiveness or efficiency in various applications, such as machine translation [156], and automatic speech recognition [21, 115] When it comes to search engines, the task is called query performance or query difficulty prediction. The task of query performance prediction (QPP) is defined as predicting the retrieval effectiveness of a search engine given an issued query with no implicit or explicit relevance information. Hauff et al [64] provided a through overview of the pre-retrieval QPP approaches

Objectives

Methods

Results

Conclusion