A multi-document summarization system based on statistics and linguistic treatment

Rafael Ferreira,Luciano De Souza Cabral,Frederico Freitas,Rafael Dueire Lins,Gabriel De França Silva,Steven J Simske,Luciano Favaro

doi:10.1016/j.eswa.2014.03.023

Abstract

The massive quantity of data available today in the Internet has reached such a huge volume that it has become humanly unfeasible to efficiently sieve useful information from it. One solution to this problem is offered by using text summarization techniques. Text summarization, the process of automatically creating a shorter version of one or more text documents, is an important way of finding relevant information in large text libraries or in the Internet. This paper presents a multi-document summarization system that concisely extracts the main aspects of a set of documents, trying to avoid the typical problems of this type of summarization: information redundancy and diversity. Such a purpose is achieved through a new sentence clustering algorithm based on a graph model that makes use of statistic similarities and linguistic treatment. The DUC 2002 dataset was used to assess the performance of the proposed system, surpassing DUC competitors by a 50% margin of f-measure, in the best case.

Full Text