Evaluating Extractive Automatic Text Summarization Techniques in Spanish

Camilo Caparrós-Laiz,José Antonio García-Díaz,Rafael Valencia-García

doi:10.1007/978-3-030-88262-4_6

Abstract

Due to the large amount of data published on the Internet, the tasks related to the automatic generation of summaries from unstructured sources have gained enormous popularity in recent years. For instance, its applications are media monitoring, newsletter generation, legal document analysis, virtual assistants that can summarize email overload, e-learning, or patent research among others. One popular approach for generating the summaries is extractive summarization, that extracts the most meaningful keywords in a document and presents them to the reader comprehensively. To the best of our knowledge, there is a lack of studies that have evaluated extractive text summarization techniques in Spanish, specially novel techniques based on state-of-the-art transformers. Consequently, we perform a benchmark of traditional and recent approaches for conducting text summarization with the Corpus-TER dataset, that consists in 240 Mexican-Spanish news articles. Our preliminary results suggest that word embeddings from Word2Vec achieves the best results based on ROUGE-1, BLEU and edit distance metrics.

Full Text