Using SVD on Clusters to Improve Precision of Interdocument Similarity Measure.

Wen Zhang,Fan Xiao,Siguang Zhang,Bin Li

doi:10.1155/2016/1096271

Abstract

Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods.

Highlights

As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available
This paper proposes singular value decomposition (SVD) on clusters as a new indexing method for Latent Semantic Indexing
Based on the review on current trend of linear algebraic methods for Latent Semantic Indexing (LSI), we claim that the state of art of LSI roughly follows two disciplines: SVD based LSI methods and non-SVD based LSI methods

Summary

Introduction

As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available. The usual logic-based programming paradigm has great difficulties in capturing fuzzy and often ambiguous relations in text documents. For this reason, text mining, which is known as knowledge discovery from texts, is proposed to deal with uncertainness and fuzziness of languages and disclose hidden patterns (knowledge) in documents. Information is retrieved by literally matching terms in documents with those of a query. Lexical matching methods can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user’s query may not match those of a relevant document

Methods

Results

Conclusion