A Framework for Generating Extractive Summary from Multiple Malayalam Documents

K Manju,S David Peter,Sumam Idicula

doi:10.3390/info12010041

K Manju, S David Peter + Show 1 more

Open Access

https://doi.org/10.3390/info12010041

Copy DOI

Journal: Information	Publication Date: Jan 18, 2021
Citations: 10	License type: CC BY 4.0

Affiliation: Cochin University of Science and Technology

Abstract

Automatic extractive text summarization retrieves a subset of data that represents most notable sentences in the entire document. In the era of digital explosion, which is mostly unstructured textual data, there is a demand for users to understand the huge amount of text in a short time; this demands the need for an automatic text summarizer. From summaries, the users get the idea of the entire content of the document and can decide whether to read the entire document or not. This work mainly focuses on generating a summary from multiple news documents. In this case, the summary helps to reduce the redundant news from the different newspapers. A multi-document summary is more challenging than a single-document summary since it has to solve the problem of overlapping information among sentences from different documents. Extractive text summarization yields the sensitive part of the document by neglecting the irrelevant and redundant sentences. In this paper, we propose a framework for extracting a summary from multiple documents in the Malayalam Language. Also, since the multi-document summarization data set is sparse, methods based on deep learning are difficult to apply. The proposed work discusses the performance of existing standard algorithms in multi-document summarization of the Malayalam Language. We propose a sentence extraction algorithm that selects the top ranked sentences with maximum diversity. The system is found to perform well in terms of precision, recall, and F-measure on multiple input documents.

Highlights

Nowadays, the amount of data on the web is growing exponentially on any topic
This study presents a generic extractive multi-document summarization model to extract a summary from multiple Malayalam documents
PageRank algorithm was run on the graph with a modification in the initial score of each vertex

Summary

Introduction

The volume of data circulating in the digital space, generally the unstructured textual data, demands building automated text summarization tools to get insights from them quickly. Document summaries provide users the briefing of the most notable information contained in the document. Automatic document summarization is one of the most challenging and exciting issues in Natural Language Processing (NLP). The automatic text summarization system has attracted substantial interest in providing relevant information in less time [1]. Text summarization is a process used to generate a simplified version of the original document. Considering the whole or particular part of a text, summarization is categorized into generic and query relevant summarization. A generic summary presents an overall sense of the document’s content, while a query-focused summary shows the document’s content related to the user query [2,3]

Objectives

Results

Conclusion