Indexing Vocabulary Research Articles

In recent years, there is an ever-increasing research focus on Bag-of-Words based near duplicate visual search paradigm with inverted indexing. One fundamental yet unexploited challenge is how to maintain the large indexing structures within a single server subject to its memory constraint, which is extremely hard to scale up to millions or even billions of images. In this paper, we propose to parallelize the near duplicate visual search architecture to index millions of images over multiple servers, including the distribution of both visual vocabulary and the corresponding indexing structure. We optimize the distribution of vocabulary indexing from a machine learning perspective, which provides a “memory light” search paradigm that leverages the computational power across multiple servers to reduce the search latency. Especially, our solution addresses two essential issues: “What to distribute” and “How to distribute”. “What to distribute” is addressed by a “lossy” vocabulary Boosting, which discards both frequent and indiscriminating words prior to distribution. “How to distribute” is addressed by learning an optimal distribution function, which maximizes the uniformity of assigning the words of a given query to multiple servers. We validate the distributed vocabulary indexing scheme in a real world location search system over 10 million landmark images. Comparing to the state-of-the-art alternatives of single-server search <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><citeref refid="ref5"/> </citerefgrp> , <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><citeref refid="ref6"/></citerefgrp> , <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <citeref refid="ref16"/></citerefgrp> and distributed search <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <citeref refid="ref23"/></citerefgrp> , our scheme has yielded a significant gain of about 200% speedup at comparable precision by distributing only 5% words. We also report excellent robustness even when partial servers crash.

Read full abstract

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge is, however, far from trivial. The first research theme investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. Document preprocessing heuristics such as stop word removal, stemming, and breakpoint identification and normalization were shown to strongly affect retrieval performance. An effective combination of heuristics was identified to obtain a word-based representation from text for the remainder of this thesis. The second research theme deals with concept-based retrieval. We compared a word-based to a concept-based representation and determined to what extent a manual concept-based representation can be automatically obtained from text. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In the third and last research theme we propose a cross-lingual framework for monolingual biomedical IR. In this framework, the integration of a concept-based representation is viewed as a cross-lingual matching problem involving a word-based and concept-based representation language. This framework gives us the opportunity to adopt a large set of established crosslingual information retrieval methods and techniques for this domain. Experiments with basic term-to-term translation models demonstrate that this approach can significantly improve word-based retrieval. Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain. Available online from http://purl.utwente.nl/publications/72481.

Read full abstract

Indexing Vocabulary Research Articles

Related Topics

Articles published on Indexing Vocabulary

Passport: Improving Automated Formal Verification Using Identifiers

Methods for Assessing the Psychological Tension of Social Network Users during the Coronavirus Pandemic and Its Uses for Predictive Analysis

Vocabulary Index as a Sustainable Resource for Teaching Extended Writing in the Post-Pandemic Era

The mutability of fiction descriptors: the evolution of ‘pulp’

Deep learning based high similarity automatic retrieval algorithm for vocabulary interpretation of workers of Food Sector in china

Analysis of AI MT Based on Fuzzy Algorithm

Toponymic Legends of Ural Amateur Miners

Repercussion of the implementation of the Picture Exchange Communication System - PECS in the overload index of mothers of children with Autism Spectrum Disorder.

Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm

Relevansi Dan Penerapan Subject Authority dalam Sistem Temu Kembali Koleksi Kitab Kuning Pusat Perpustakaan UIN Maulana Malik Ibrahim Malang

Pan-granularism and specificity

Meshable: searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms.

Knowledge Organisation and its Role in Multimedia Information Retrieval

Relação entre vocabulário receptivo e expressivo em crianças com transtorno específico do desenvolvimento da fala e da linguagem

Can Indexing Be Automated? The Example of the Deutsche Nationalbibliothek

Task-Oriented Creative Writing with Système-D

Learning to Distribute Vocabulary Indexing for Scalable Visual Search

Automatic indexing by discipline and high-level categories: Methodology and potential applications.

Controlled vocabularies and tags: An analysis of research methods

Proof of concept

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Indexing Vocabulary Research Articles

Related Topics

Articles published on Indexing Vocabulary

Passport: Improving Automated Formal Verification Using Identifiers

Methods for Assessing the Psychological Tension of Social Network Users during the Coronavirus Pandemic and Its Uses for Predictive Analysis

Vocabulary Index as a Sustainable Resource for Teaching Extended Writing in the Post-Pandemic Era

The mutability of fiction descriptors: the evolution of ‘pulp’

Deep learning based high similarity automatic retrieval algorithm for vocabulary interpretation of workers of Food Sector in china

Analysis of AI MT Based on Fuzzy Algorithm

Toponymic Legends of Ural Amateur Miners

Repercussion of the implementation of the Picture Exchange Communication System - PECS in the overload index of mothers of children with Autism Spectrum Disorder.

Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm

Relevansi Dan Penerapan Subject Authority dalam Sistem Temu Kembali Koleksi Kitab Kuning Pusat Perpustakaan UIN Maulana Malik Ibrahim Malang

Pan-granularism and specificity

Meshable: searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms.

Knowledge Organisation and its Role in Multimedia Information Retrieval

Relação entre vocabulário receptivo e expressivo em crianças com transtorno específico do desenvolvimento da fala e da linguagem

Can Indexing Be Automated? The Example of the Deutsche Nationalbibliothek

Task-Oriented Creative Writing with Système-D

Learning to Distribute Vocabulary Indexing for Scalable Visual Search

Automatic indexing by discipline and high-level categories: Methodology and potential applications.

Controlled vocabularies and tags: An analysis of research methods

Proof of concept