Term-document Matrices Research Articles

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.

Read full abstract

We present four numerical methods for computing the singular value decomposition (SVD) of large sparse matrices on a multiprocessor architecture. We emphasize Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values and corresponding left- and right-singular vectors) for sparse matrices arising from two practical applications: information retrieval and seismic reflection tomography. The target architectures for our implementations are the CRAY-2S/4–128 and Alliant FX/80. The sparse SVD problem is well motivated by recent information-retrieval techniques in which dominant singular values and their corresponding singular vectors of large sparse term-document matrices are desired, and by nonlinear inverse problems from seismic tomography applications which require approximate pseudo-inverses of large sparse Jacobian matrices. This research may help advance the development of future out-of-core sparse SVD methods, which can be used, for example, to handle extremely large sparse matrices 0 × (106) rows or columns associated with extremely large databases in query-based information-retrieval applications.

Read full abstract

Term-document Matrices Research Articles

Articles published on Term-document Matrices

Human Rights Texts: Converting Human Rights Primary Source Documents into Data

Text mining techniques for the translation of personality questionnaires in cross-cultural research

Maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification

Updating the partial singular value decomposition in latent semantic indexing

Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing

Generating hierarchical document indices from common denominators in large document collections

COMPUTING EXTREMAL SINGULAR TRIPLETS OF SPARSE MATRICES ON A SHARED-MEMORY MULTIPROCESSOR

Large-Scale Sparse Singular Value Computations

Transitive reduction of a rectangular boolean matrix

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Term-document Matrices Research Articles

Articles published on Term-document Matrices

Human Rights Texts: Converting Human Rights Primary Source Documents into Data

Text mining techniques for the translation of personality questionnaires in cross-cultural research

Maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification

Updating the partial singular value decomposition in latent semantic indexing

Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing

Generating hierarchical document indices from common denominators in large document collections

COMPUTING EXTREMAL SINGULAR TRIPLETS OF SPARSE MATRICES ON A SHARED-MEMORY MULTIPROCESSOR

Large-Scale Sparse Singular Value Computations

Transitive reduction of a rectangular boolean matrix