Large-scale information retrieval with latent semantic indexing

Todd A Letsche,Michael W Berry

doi:10.1016/s0020-0255(97)00044-3

Abstract

As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words, and rank the results of the search according to their relative similarity to the query. One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming Interface (API) allows applications to immediately use LSI without knowing the implementation details of the underlying system. LSI++ supports both serial and distributed searching of large data sets, providing the same programming interface regardless of the implementation actually executing. In addition, a World Wide Web interface was created to allow simple, intuitive searching of document collections using LSI++. Timing results indicate that the serial implementation of LSI++ searches up to six times faster than the original implementation of LSI, while the parallel implementation searches nearly 180 times faster on large document collections.

Full Text