Abstract

Multilingual Multi-Document Summarization aims at ranking the sentences of a cluster with (at least) 2 news texts (1 in the user’s language and 1 in a foreign language), and select the top-ranked sentences for a summary in the user’s language. We explored three concept-based statistics and one superficial strategy for sentence ranking. We used a bilingual corpus (Brazilian Portuguese-English) encoded in UNL (Universal Network Language) with source and summary sentences aligned based on content overlap. Our experiment shows that “concept frequency normalized by the number of concepts in the sentence” is the measure that best ranks the sentences selected by humans. However, it does not outperform the superficial strategy based on the position of the sentences in the texts. This indicates that the most frequent concepts are not always contained in first sentences, usually selected by humans to build the summaries because they convey the main information of the collection.Keywords: content selection; concept; statistical measure; multilingual corpus; multi-document summarization.

Highlights

  • Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media

  • Given the promising results of Tosta (2014) and Di-Felippo et al (2016), we have explored the potential of 3 concept-based measures to capture human content selection strategies in Multilingual Multi-Document Summarization (MMDS): (i) CF, (ii) CF*IDF, and (iii) CF/No of Cs in S

  • Based on the review of the literature, we have selected 3 lexicalconceptual measures that are potentially adequate to capture human content selection strategies in MMDS: (i) concept frequency, (ii) concept frequency corrected by the inverted document frequency, and (iii) concept frequency normalized by the number of concepts in the sentence

Read more

Summary

Introduction

Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media. Natural Language Processing (NLP) applications that address the goal of treating multiple languages in different multidocument summarization tasks are relevant tools to deal with the huge and overloaded amount of information in multiple languages One of these applications is the cross-language summarization, which is the production of a summary in a language Lx when the cluster (i.e., cluster of news texts on the same topic) is in a language Ly different from Lx (SARKAR, 2014). The approaches of the first category do not use much semantic or language specific information They can make only some minimal assumptions about the language (e.g., that the text can be split into sentences and sentences further into words) and perform well on different languages without linguistic knowledge. The experiment shows that measure (iii) produces the rank with the highest number of aligned sentences, having the best performance in capturing the human preferences It did not outperform the sentence position baseline.

Related works
The UNLization: conceptual annotation
The Alignment of Source Texts and Human Summaries
Lexical-Conceptual Measures
Investigation of the measures for sentence selection in MMDS
Findings
Final Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call