Exploring content selection strategies for Multilingual Multi-Document Summarization based on the Universal Network Language (UNL)

Matheus Rigobelo Chaud,Ariani Di Felippo

doi:10.17851/2237-2083.26.1.45-71

Abstract

Abstract: Multilingual Multi-Document Summarization aims at ranking the sentences of a cluster with (at least) 2 news texts (1 in the user’s language and 1 in a foreign language), and select the top-ranked sentences for a summary in the user’s language. We explored three concept-based statistics and one superficial strategy for sentence ranking. We used a bilingual corpus (Brazilian Portuguese-English) encoded in UNL (Universal Network Language) with source and summary sentences aligned based on content overlap. Our experiment shows that “concept frequency normalized by the number of concepts in the sentence” is the measure that best ranks the sentences selected by humans. However, it does not outperform the superficial strategy based on the position of the sentences in the texts. This indicates that the most frequent concepts are not always contained in first sentences, usually selected by humans to build the summaries because they convey the main information of the collection.Keywords: content selection; concept; statistical measure; multilingual corpus; multi-document summarization.Keywords: content selection; concept; statistical measure; multilingualcorpus; multi-document summarization.Resumo: O objetivo da Sumarização Automática Multilíngue Multidocumento é ranquear as sentenças de uma coleção com ao menos duas notícias (1 na língua do usuário e 1 em língua estrangeira) e selecionar as mais bem pontuadas para compor um sumário na língua do usuário. Exploramos três estatísticas conceituais e uma estratégia superficial para criar um ranque das sentenças quanto à relevância. Para tanto, utilizamos um corpus bilíngue (português-inglês) anotado via UNL (Universal Network Language) e com textos-fonte e sumários alinhados em nível sentencial. A avaliação indica que a estatísticadenominada frequência de conceitos normalizada pelo número de conceitos da sentença é a que melhor reproduz o ranqueamento humano. Essa medida, entretanto, não supera a estratégia superficial baseada na posição das sentenças. Isso indica que os conceitos mais frequentes do cluster nem sempre estão contidos nas primeiras sentenças dos textosfonte, usualmente selecionadas pelos humanos para compor os sumários porque veiculam a informação principal da coleção.Palavras-chave: seleção de conteúdo; conceito; medida estatística; corpus multilíngue; sumarização multidocumento.

Highlights

Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media
Given the promising results of Tosta (2014) and Di-Felippo et al (2016), we have explored the potential of 3 concept-based measures to capture human content selection strategies in Multilingual Multi-Document Summarization (MMDS): (i) CF, (ii) CF*IDF, and (iii) CF/No of Cs in S
Based on the review of the literature, we have selected 3 lexicalconceptual measures that are potentially adequate to capture human content selection strategies in MMDS: (i) concept frequency, (ii) concept frequency corrected by the inverted document frequency, and (iii) concept frequency normalized by the number of concepts in the sentence

Summary

Introduction

Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media. Natural Language Processing (NLP) applications that address the goal of treating multiple languages in different multidocument summarization tasks are relevant tools to deal with the huge and overloaded amount of information in multiple languages One of these applications is the cross-language summarization, which is the production of a summary in a language Lx when the cluster (i.e., cluster of news texts on the same topic) is in a language Ly different from Lx (SARKAR, 2014). The approaches of the first category do not use much semantic or language specific information They can make only some minimal assumptions about the language (e.g., that the text can be split into sentences and sentences further into words) and perform well on different languages without linguistic knowledge. The experiment shows that measure (iii) produces the rank with the highest number of aligned sentences, having the best performance in capturing the human preferences It did not outperform the sentence position baseline.

Related works

The UNLization: conceptual annotation

The Alignment of Source Texts and Human Summaries

Lexical-Conceptual Measures

Investigation of the measures for sentence selection in MMDS

Findings

Final Remarks

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Revista de Estudos da Linguagem	Publication Date: Nov 30, 2017
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploring content selection strategies for Multilingual Multi-Document Summarization based on the Universal Network Language (UNL)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Revista de Estudos da Linguagem

Lead the way for us

Similar Papers

Role of Punjabi morphology in designing Punjabi-UNL enconverter
Parteek Bhatia ... R K Sharma
-
Parteek Bhatia, et. al.Parteek Bhatia ... R K Sharma
23 Jan 2009
23 Jan 2009

Multilingual Neural Translation

-

14 Feb 2020
14 Feb 2020

Paraphrase identification of malayalam sentences - an experience
Ditty Mathew ... Sumam Mary Idicula
-
Ditty Mathew, et. al.Ditty Mathew ... Sumam Mary Idicula
01 Dec 2013
01 Dec 2013

Contrastive Aligned Joint Learning for Multilingual Summarization
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring content selection strategies for Multilingual Multi-Document Summarization based on the Universal Network Language (UNL)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Revista de Estudos da Linguagem