NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

Vicent Ahuir,José Ángel González,Lluís-F Hurtado,Encarna Segarra

doi:10.3390/app11219872

Vicent Ahuir, José Ángel González + Show 2 more

Open Access

https://doi.org/10.3390/app11219872

Copy DOI

Abstract

Most of the models proposed in the literature for abstractive summarization are generally suitable for the English language but not for other languages. Multilingual models were introduced to address that language constraint, but despite their applicability being broader than that of the monolingual models, their performance is typically lower, especially for minority languages like Catalan. In this paper, we present a monolingual model for abstractive summarization of textual content in the Catalan language. The model is a Transformer encoder-decoder which is pretrained and fine-tuned specifically for the Catalan language using a corpus of newspaper articles. In the pretraining phase, we introduced several self-supervised tasks to specialize the model on the summarization task and to increase the abstractivity of the generated summaries. To study the performance of our proposal in languages with higher resources than Catalan, we replicate the model and the experimentation for the Spanish language. The usual evaluation metrics, not only the most used ROUGE measure but also other more semantic ones such as BertScore, do not allow to correctly evaluate the abstractivity of the generated summaries. In this work, we also present a new metric, called content reordering, to evaluate one of the most common characteristics of abstractive summaries, the rearrangement of the original content. We carried out an exhaustive experimentation to compare the performance of the monolingual models proposed in this work with two of the most widely used multilingual models in text summarization, mBART and mT5. The experimentation results support the quality of our monolingual models, especially considering that the multilingual models were pretrained with many more resources than those used in our models. Likewise, it is shown that the pretraining tasks helped to increase the degree of abstractivity of the generated summaries. To our knowledge, this is the first work that explores a monolingual approach for abstractive summarization both in Catalan and Spanish.

Highlights

The purpose of the summarization process is to condense the most relevant information from a document or a set of documents into a small number of sentences
Multilingual models such as mBART [9] or mT5 [10] were studied in the literature to address that language constraint, but despite their applicability being broader than that of the monolingual models, their performance is typically lower, especially on languages that are underrepresented in the pretraining corpora, or differ so much in linguistic terms from the most represented languages [11,12,13,14]
Monolingual pretraining in languages other than English is still unexplored for language generation tasks such as abstractive summarization. This is the first work that explores a monolingual approach for abstractive summarization both in Catalan and Spanish

Summary

Introduction

The purpose of the summarization process is to condense the most relevant information from a document or a set of documents into a small number of sentences. While extractive summarization consists of identifying and copying those sentences in the original document that contain the most remarkable and useful information, abstractive summaries require abstractive actions that must be mastered In this way, summaries are not mere clippings of the original documents; rather, abstractive summarizations are created by choosing the most important phrases of the documents and paraphrasing that content, creating a combination of some phrases, introducing new words, searching for synonyms, creating generalizations or specifications of some words or reordering content. A monolingual abstractive text summarization model, News Abstract Summarization for Catalan (NASCA), is proposed This model, based on the BART architecture [6], is pretrained with several self-supervised tasks to improve the abstractivity of the generated summaries. The monolingual models, NASCA (https://huggingface.co/ELiRF/NASCA, accessed on 19 October 2021) and NASES (https://huggingface.co/ELiRF/NASES, accessed on 19 October 2021), proposed in this work were publicly release through HuggingFace model hub [16]

Related Work

Newspapers Summarization Corpus

Summary Sents

Summarization Models

Metrics

Results

Summarization Performance of the Models for CATALAN

Abstractivity of the Summaries Generated by the Models for Catalan

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Oct 22, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

A Toxic Comment Classification Model Based on Ensemble
Jian Xu ... Yuqing Zhai
Journal of Physics: Conference Series | VOL. 1873
Jian Xu, et. al.Jian Xu ... Yuqing Zhai
01 Apr 2021
Journal of Physics: Conference Series | VOL. 1873

Are the Multilingual Models Better? Improving Czech Sentiment with Transformers
Pavel PˇRib´AˇN ... Josef Steinberger
-
Pavel PˇRib´AˇN, et. al.Pavel PˇRib´AˇN ... Josef Steinberger
01 Jan 2020
01 Jan 2020

Improving sentence representation for vietnamese natural language understanding using optimal transport
Phu Xuan-Vinh Nguyen ... Kiet Van Nguyen
Journal of Intelligent & Fuzzy Systems | VOL. -
Phu Xuan-Vinh Nguyen, et. al.Phu Xuan-Vinh Nguyen ... Kiet Van Nguyen
27 Jun 2023
Journal of Intelligent & Fuzzy Systems | VOL. -

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi
Abhishek Velankar ... Hrushikesh Patil
-
Abhishek Velankar, et. al.Abhishek Velankar ... Hrushikesh Patil
11 Nov 2022
11 Nov 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences