RoSummary: Control Tokens for Romanian News Summarization

Mihai Alexandru Niculescu,Mihai Dascalu,Stefan Ruseti

doi:10.3390/a15120472

Mihai Alexandru Niculescu, Mihai Dascalu + Show 1 more

Open Access

https://doi.org/10.3390/a15120472

Copy DOI

Journal: Algorithms	Publication Date: Dec 11, 2022
Citations: 3	License type: CC BY 4.0

Affiliation: Polytechnic University of Bucharest

Abstract

Significant progress has been achieved in text generation due to recent developments in neural architectures; nevertheless, this task remains challenging, especially for low-resource languages. This study is centered on developing a model for abstractive summarization in Romanian. A corresponding dataset for summarization is introduced, followed by multiple models based on the Romanian GPT-2, on top of which control tokens were considered to specify characteristics for the generated text, namely: counts of sentences and words, token ratio, and n-gram overlap. These are special tokens defined in the prompt received by the model to indicate traits for the text to be generated. The initial model without any control tokens was assessed using BERTScore (F1 = 73.43%) and ROUGE (ROUGE-L accuracy = 34.67%). Control tokens improved the overall BERTScore to 75.42% using <LexOverlap>, while the model was influenced more by the second token specified in the prompt when performing various combinations of tokens. Six raters performed human evaluations of 45 generated summaries with different models and decoding methods. The generated texts were all grammatically correct and consistent in most cases, while the evaluations were promising in terms of main idea coverage, details, and cohesion. Paraphrasing still requires improvements as the models mostly repeat information from the reference text. In addition, we showcase an exploratory analysis of the generated summaries using one or two specific control tokens.

Full Text