TOPNMF: Topic based Document Clustering using Non-negative Matrix Factorization

R Parimala,K Gomathi

doi:10.17485/ijst/v14i31.1293

Abstract

Objectives: This work focuses on creating targeted content-specific topicbased clusters. They can help users to discover the topics in a set of documents information more efficiently. Methods/Statistical analysis: The Non-negative Matrix Factorization (NMF) based models learn topics by directly decomposing the term-document matrix, which is a bag-of-word matrix representation of a text corpus, into two low-rank factor matrices namely Word-Topic feature Matrix(WTOM) and Document-Topic feature Matrix(DTOM). Topic clusters and Document clusters are extracted from obtained features matrices. This method does not require any statistical distribution and probability. Experiments were carried out on a subset of BBC sport Corpus. Findings: The experimental results indicate that the accuracy of TONMF clusters was observed as 100 percent. Novelty/Applications: NMF often fails to improve the given clustering result as the number of parameters increases linearly with the size of the corpus. The computational complexity of the TOPNMF is better than exact decomposition like Singular Value Decomposition (SVD). Keywords: Topic cluster; Document cluster; Non-negative matrix factorization; K-means clustering; Word cloud

Highlights

Online activities are generating an outsized volume of unstructured text within emails, blog spot, social media posts, on-line reviews, news articles etc
There are a variety of commonly used topic modeling algorithms including Singular Value Decomposition (SVD), negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA), and Structural Topic Model (STM)
SVD is a method of decomposing a structured format of unstructured text into the orthogonal left singular matrix, which represents the relationship between word and latent topics; a diagonal matrix which describes the strength of each latent topic, and right singular matrix, which indicates the similarity between documents and latent topics

Summary

Introduction

Online activities are generating an outsized volume of unstructured text within emails, blog spot, social media posts, on-line reviews, news articles etc. Grouping massive amounts of text results in creating mistakes and inconsistencies and is a time overwhelming aspect. Topic clustering is an unsupervised learning problem which finds unknown groups of similar data. Topic modeling attempts to discover and annotate thematic structure in the collection of documents(1). There are a variety of commonly used topic modeling algorithms including SVD, NMF, Latent Dirichlet Allocation (LDA), and Structural Topic Model (STM). SVD is a method of decomposing a structured format of unstructured text into the orthogonal left singular matrix, which represents the relationship between word and latent topics; a diagonal matrix which describes the strength of each latent topic, and right singular matrix, which indicates the similarity between documents and latent topics

Objectives

Methods

Results

Conclusion