CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

E Laxmi Lydia,Vijayakumar Varadarajan,G Jose Moses,Eswaran Perumal,Fredi Nonyelu,K Shankar,Andino Maseleno

doi:10.22452/mjcs.sp2020no1.8

E Laxmi Lydia, Vijayakumar Varadarajan + Show 5 more

Open Access

https://doi.org/10.22452/mjcs.sp2020no1.8

Copy DOI

Abstract

Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

Abstract

Talk to us

Similar Papers

More From: Malaysian Journal of Computer Science

Lead the way for us

Journal: Malaysian Journal of Computer Science	Publication Date: Nov 27, 2020
Citations: 5

Similar Papers

A factorization based recommender system for online services
U Simsekli ... T Birdal
-
U Simsekli, et. al.U Simsekli ... T Birdal
01 Apr 2013
01 Apr 2013

Spectral Unmixing Model Based on Non-negative Matrix Factorization with Spatial and Spectral Correlation Constraints
Bo Yuan
-
Bo YuanBo Yuan
01 Jul 2022
01 Jul 2022

Accuracy optimized neural networks do not effectively model optic flow tuning in brain area MSTd.
Oliver W Layton ... Scott T Steinmetz
Frontiers in neuroscience | VOL. 18
Oliver W Layton, et. al.Oliver W Layton ... Scott T Steinmetz
02 Sep 2024
Frontiers in neuroscience | VOL. 18

Development of a new SMP model satisfying all known physical constraints in environmental application
Bong Mann Kim
Chemometrics and Intelligent Laboratory Systems | VOL. 121
Bong Mann KimBong Mann Kim
09 Dec 2012
Chemometrics and Intelligent Laboratory Systems | VOL. 121

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

Abstract

Talk to us

Similar Papers

More From: Malaysian Journal of Computer Science