Information Retrieval Problem Research Articles

In today's world, there is no shortage of information. However, for a specific information need, only a small subset of all of the available information will be useful. The field of information retrieval (IR) is the study of methods to provide users with that small subset of information relevant to their needs and to do so in a timely fashion. Information sources can take many forms, but this thesis will focus on text based information systems and investigate problems germane to the retrieval of written natural language documents. Central to these problems is the notion of topic. In other words, what are documents about? However, topics depend on the semantics of documents and retrieval systems are not endowed with knowledge of the semantics of natural language. The approach taken in this thesis will be to make use of probabilistic language models to investigate text based information retrieval and related problems. One such problem is the prediction of topic shifts in text, the topic segmentation problem. It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection. Two complementary sets of features are studied individually and then combined into a single language model. The language modeling approach allows this problem to be approached in a principled way without complex semantic modeling. Next, the problem of document retrieval in response to a user query will be investigated. Models of document indexing and document retrieval have been extensively studied over the past three decades. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. Much of the reason for this is that the indexing component requires inferences as to the semantics of documents. Instead, an approach to retrieval based on probabilistic language modeling will be presented. Models are estimated for each document individually. The approach to modeling is non-parametric and integrates the entire retrieval process into a single model. One advantage of this approach is that collection statistics, which are used heuristically for the assignment of concept probabilities in other probabilistic models, are used directly in the estimation of language model probabilities in this approach. The language modeling approach has been implemented and tested empirically and performs very well on standard test collections and query sets. In order to improve retrieval effectiveness, IR systems use additional techniques such as relevance feedback, unsupervised query expansion and structured queries. These and other techniques are discussed in terms of the language modeling approach and empirical results are given for several of the techniques developed. These results provide further proof of concept for the use of language models for retrieval tasks.

Read full abstract

Search diversification (also called diversity search), is an important approach to tackling the query ambiguity problem in information retrieval. It aims to diversify the search results that are originally ranked according to their probabilities of relevance to a given query, by re-ranking them to cover as many as possible different aspects (or subtopics) of the query. Most existing diversity search models heuristically balance the relevance ranking and the diversity ranking, yet lacking an efficient learning mechanism to reach an optimized parameter setting. To address this problem, we propose a learning-to-diversify approach which can directly optimize the search diversification performance (in term of any effectiveness metric). We first extend the ranking function of a widely used learning-to-rank framework, i.e., LambdaMART, so that the extended ranking function can correlate relevance and diversity indicators. Furthermore, we develop an effective learning algorithm, namely Document Repulsion Model (DRM), to train the ranking function based on a Document Repulsion Theory (DRT). DRT assumes that two result documents covering similar query aspects (i.e., subtopics) should be mutually repulsive, for the purpose of search diversification. Accordingly, the proposed DRM exerts a repulsion force between each pair of similar documents in the learning process, and includes the diversity effectiveness metric to be optimized as part of the loss function. Although there have been existing learning based diversity search methods, they often involve an iterative sequential selection process in the ranking process, which is computationally complex and time consuming for training, while our proposed learning strategy can largely reduce the time cost. Extensive experiments are conducted on the TREC diversity track data (2009, 2010 and 2011). The results demonstrate that our model significantly outperforms a number of baselines in terms of effectiveness and robustness. Further, an efficiency analysis shows that the proposed DRM has a lower computational complexity than the state of the art learning-to-diversify methods.

Read full abstract

Information Retrieval Problem Research Articles

Related Topics

Articles published on Information Retrieval Problem

Bees swarm optimization guided by data mining techniques for document information retrieval

Expert Search Strategies: The Information Retrieval Practices of Healthcare Information Professionals.

A Language Modeling Approach to Information Retrieval

Improved sqrt-cosine similarity measurement

GPU-based exhaustive algorithms processing kNN queries

The Capacity of Private Information Retrieval

Learning to diversify web search results with a Document Repulsion Model

Disambiguating context-dependent polarity of words: An information retrieval approach

IRaPPA: information retrieval based integration of biophysical models for protein assembly selection.

Efficiently Mining High Quality Phrases from Texts

Informed Group-Sparse Representation for Singing Voice Separation

СИСТЕМА ПОИСКА И АНАЛИЗА ДОСТОВЕРНОЙ ИНФОРМАЦИИ В СЕТИ ИНТЕРНЕТ

CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

Cryptoleq: A Heterogeneous Abstract Machine for Encrypted and Unencrypted Computation

Near Duplicate Document Detection using Document Image

Style-based exploration of illustration datasets

Automatic and online setting of similarity thresholds in content-based visual information retrieval problems

Learning in Variable-Dimensional Spaces.

Query-guided maximum satisfiability

Extraction of Root Words using Morphological Analyzer for Devanagari Script

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Information Retrieval Problem Research Articles

Related Topics

Articles published on Information Retrieval Problem

Bees swarm optimization guided by data mining techniques for document information retrieval

Expert Search Strategies: The Information Retrieval Practices of Healthcare Information Professionals.

A Language Modeling Approach to Information Retrieval

Improved sqrt-cosine similarity measurement

GPU-based exhaustive algorithms processing kNN queries

The Capacity of Private Information Retrieval

Learning to diversify web search results with a Document Repulsion Model

Disambiguating context-dependent polarity of words: An information retrieval approach

IRaPPA: information retrieval based integration of biophysical models for protein assembly selection.

Efficiently Mining High Quality Phrases from Texts

Informed Group-Sparse Representation for Singing Voice Separation

СИСТЕМА ПОИСКА И АНАЛИЗА ДОСТОВЕРНОЙ ИНФОРМАЦИИ В СЕТИ ИНТЕРНЕТ

CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

Cryptoleq: A Heterogeneous Abstract Machine for Encrypted and Unencrypted Computation

Near Duplicate Document Detection using Document Image

Style-based exploration of illustration datasets

Automatic and online setting of similarity thresholds in content-based visual information retrieval problems

Learning in Variable-Dimensional Spaces.

Query-guided maximum satisfiability

Extraction of Root Words using Morphological Analyzer for Devanagari Script