Summarizing large text collection using topic modeling and clustering based on MapReduce framework

N K Nagwani

doi:10.1186/s40537-015-0020-5

Abstract

Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.

Highlights

Text summarization is one of the important and challenging problems in text mining
Text summarization is a function of converting large text information to small text information in such a manner that the small text information carries the overall picture of the large text collection as given in equation (1), where D represents the Nagwani Journal of Big Data (2015) 2:6 large text collection and d represents the summarized text document and the size of large text collection D is larger than the size of summarized document d
In single-document summarizers, a single large text document is summarized to another single document summary, whereas in multi-document summarization, a set of text documents are summarized to a single document summary which represents the overall glimpse of the multiple documents

Summary

Introduction

Text summarization is one of the important and challenging problems in text mining. It provides a number of benefits to users and a number of fruitful real life applications can be developed using text summarization. A number of summarization techniques are proposed to generate summaries by extracting the important sentences from the given collection of documents. A MapReduce framework based summarization method is proposed to generate the summaries from large text collections. The algorithm Based on the methodology discussed in the previous section the algorithm for proposed multi document summarization using semantic similarity based clustering technique is presented . In this work the average precision, recall and F-measure scores generated by ROUGE-1, ROUGE-2, and ROUGE-L are used to measure the performance of the summaries and to compare the presented algorithm over the MapReduce framework. The pyramid score Ρ is the ratio of D to Max. Because P compares the actual distribution of SCUs to an empirically determined weighting, it provides a direct correlate of the way human summarizers select information from source texts

Result analysis

Conclusions and future enhancements

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Jun 26, 2015
Citations: 99	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Summarizing large text collection using topic modeling and clustering based on MapReduce framework

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
Nabil Alami ... Ouafae Ammor
Expert Systems with Applications | VOL. 172
Nabil Alami, et. al.Nabil Alami ... Ouafae Ammor
02 Feb 2021
Expert Systems with Applications | VOL. 172

A Brief Note on DocumentSummarization

-

01 Aug 2020
01 Aug 2020

Models, Inference, and Implementation for Scalable Probabilistic Models of Text

-

01 Jan 2014
01 Jan 2014

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.
Gizem Soğancıoğlu ... Hakime Öztürk
Bioinformatics | VOL. 33
Gizem Soğancıoğlu, et. al.Gizem Soğancıoğlu ... Hakime Öztürk
12 Jul 2017
Bioinformatics | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Summarizing large text collection using topic modeling and clustering based on MapReduce framework

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data