Abstract

Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods.

Highlights

  • As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available

  • This paper proposes singular value decomposition (SVD) on clusters as a new indexing method for Latent Semantic Indexing

  • Based on the review on current trend of linear algebraic methods for Latent Semantic Indexing (LSI), we claim that the state of art of LSI roughly follows two disciplines: SVD based LSI methods and non-SVD based LSI methods

Read more

Summary

Introduction

As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available. The usual logic-based programming paradigm has great difficulties in capturing fuzzy and often ambiguous relations in text documents. For this reason, text mining, which is known as knowledge discovery from texts, is proposed to deal with uncertainness and fuzziness of languages and disclose hidden patterns (knowledge) in documents. Information is retrieved by literally matching terms in documents with those of a query. Lexical matching methods can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user’s query may not match those of a relevant document

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call