Reducing Dimensionality in Text Mining using Conjugate Gradients and Hybrid Cholesky Decomposition

Jasem M

doi:10.14569/ijacsa.2017.080716

Abstract

Generally, data mining in larger datasets consists of certain limitations in identifying the relevant datasets for the given queries. The limitations include: lack of interaction in the required objective space, inability to handle the data sets or discrete variables in datasets, especially in the presence of missing variables and inability to classify the records as per the given query, and finally poor generation of explicit knowledge for a query increases the dimensionality of the data. Hence, this paper aims at resolving the problems with increasing data dimensionality in datasets using modified non-integer matrix factorization (NMF). Further, the increased dimensionality arising due to non-orthogonally of NMF is resolved with Cholesky decomposition (cdNMF). Initially, the structuring of datasets is carried out to form a well-defined geometric structure. Further, the complex conjugate values are extracted and conjugate gradient algorithm is applied to reduce the sparse matrix from the data vector. The cdNMF is used to extract the feature vector from the dataset and the data vector is linearly mapped from upper triangular matrix obtained from the Cholesky decomposition. The experiment is validated against accuracy and normalized mutual information (NMI) metrics over three text databases of varied patterns. Further, the results prove that the proposed technique fits well with larger instances in finding the documents as per the query, than NMF, neighborhood preserving: nonnegative matrix factorization (NPNMF), multiple manifolds non-negative matrix factorization (MMNMF), robust non-negative matrix factorization (RNMF), graph regularized non-negative matrix factorization (GNMF), hierarchical non-negative matrix factorization (HNMF) and cdNMF.

Highlights

Computing application in several fields generates numerous data over several instances
The other problems associated with larger data instances include, improper association or interaction in the feature space, lack of ability to handle the large datasets with discrete variables, inability to classify the data and poor knowledge generation for a given query, and poor computation due to missing variables
The cdNMF system for evaluating the datasets is compared with conventional algorithms and that include: non-integer matrix factorization (NMF) [51], graph regularized non-negative matrix factorization (GNMF) [5], neighborhood preserving: nonnegative matrix factorization (NPNMF) [6], multiple manifolds non-negative matrix factorization (MMNMF) [7] and robust non-negative matrix factorization (RNMF) [8]

Summary

Introduction

Computing application in several fields generates numerous data over several instances. The large datasets with numerous instances poses severe challenges and that leads to improper processing of such huge data volume. The reduction of improper values from the datasets provides a greater impact and this increases the performance of processing the large data [2]; the improved mining approach is not useful in some cases [3]. In spite of many efforts to deal with such instances, data mining algorithm, undergoes severe challenges due to nonapplicability of datasets with large instances. The other problems associated with larger data instances include, improper association or interaction in the feature space, lack of ability to handle the large datasets with discrete variables, inability to classify the data and poor knowledge generation for a given query, and poor computation due to missing variables

Objectives

Results

Conclusion