Survey on Extractive Text Summarization Methods with Multi-Document Datasets

P N Varalakshmi K,Jagadish S Kallimani

doi:10.1109/icacci.2018.8554768

Abstract

Text summarization has been one of the key research areas in Natural Language Processing (NLP) for a while. The various methods to summarize one or more documents can be broadly classified into extractive and abstractive text summarization where the former involves selecting key parts in the document and embedding into the summary while balancing between salience and redundancy. The latter involves creating new sentences to provide a summary of the documents. Extractive summarization can further be done in a supervised manner with humans or an unsupervised manner without any human intervention. This paper provides the knowledge a few of the current methods to perform extractive text summarization where the input would be multi document sets. Multi document summarization can consider two types of document sets; a homogeneous set of documents which have a common topic or theme and a heterogeneous set where the main topic for the documents are unrelated but they contain some form information that is related to the summary. The first method uses sentence regression where they consider performing sentence ranking along with sentence relations followed by greedy selection process. The second is an unsupervised paragraph embedding method utilizing a density peaks clustering method. The third method proposes document-level reconstruction using a neural document model. The fourth method is a query focused, joint neural network based model with an attention mechanism. The fifth method concentrates on coherence by providing a graph-based model which does not require discourse analysis as a prerequisite. We also see a way to create a heterogeneous multi-documentcorpus along with the limitations of each of these methods.

Full Text