Abstract
Abstract: Multilingual Multi-Document Summarization aims at ranking the sentences of a cluster with (at least) 2 news texts (1 in the user’s language and 1 in a foreign language), and select the top-ranked sentences for a summary in the user’s language. We explored three concept-based statistics and one superficial strategy for sentence ranking. We used a bilingual corpus (Brazilian Portuguese-English) encoded in UNL (Universal Network Language) with source and summary sentences aligned based on content overlap. Our experiment shows that “concept frequency normalized by the number of concepts in the sentence” is the measure that best ranks the sentences selected by humans. However, it does not outperform the superficial strategy based on the position of the sentences in the texts. This indicates that the most frequent concepts are not always contained in first sentences, usually selected by humans to build the summaries because they convey the main information of the collection.Keywords: content selection; concept; statistical measure; multilingual corpus; multi-document summarization.Keywords: content selection; concept; statistical measure; multilingualcorpus; multi-document summarization.Resumo: O objetivo da Sumarização Automática Multilíngue Multidocumento é ranquear as sentenças de uma coleção com ao menos duas notícias (1 na língua do usuário e 1 em língua estrangeira) e selecionar as mais bem pontuadas para compor um sumário na língua do usuário. Exploramos três estatísticas conceituais e uma estratégia superficial para criar um ranque das sentenças quanto à relevância. Para tanto, utilizamos um corpus bilíngue (português-inglês) anotado via UNL (Universal Network Language) e com textos-fonte e sumários alinhados em nível sentencial. A avaliação indica que a estatísticadenominada frequência de conceitos normalizada pelo número de conceitos da sentença é a que melhor reproduz o ranqueamento humano. Essa medida, entretanto, não supera a estratégia superficial baseada na posição das sentenças. Isso indica que os conceitos mais frequentes do cluster nem sempre estão contidos nas primeiras sentenças dos textosfonte, usualmente selecionadas pelos humanos para compor os sumários porque veiculam a informação principal da coleção.Palavras-chave: seleção de conteúdo; conceito; medida estatística; corpus multilíngue; sumarização multidocumento.
Highlights
Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media
Given the promising results of Tosta (2014) and Di-Felippo et al (2016), we have explored the potential of 3 concept-based measures to capture human content selection strategies in Multilingual Multi-Document Summarization (MMDS): (i) CF, (ii) CF*IDF, and (iii) CF/No of Cs in S
Based on the review of the literature, we have selected 3 lexicalconceptual measures that are potentially adequate to capture human content selection strategies in MMDS: (i) concept frequency, (ii) concept frequency corrected by the inverted document frequency, and (iii) concept frequency normalized by the number of concepts in the sentence
Summary
Even though a wide number of news agencies make information available on the web, it is very difficult to know what is happening in the World unless an event is tragic enough to catch the attention of the international media. Natural Language Processing (NLP) applications that address the goal of treating multiple languages in different multidocument summarization tasks are relevant tools to deal with the huge and overloaded amount of information in multiple languages One of these applications is the cross-language summarization, which is the production of a summary in a language Lx when the cluster (i.e., cluster of news texts on the same topic) is in a language Ly different from Lx (SARKAR, 2014). The approaches of the first category do not use much semantic or language specific information They can make only some minimal assumptions about the language (e.g., that the text can be split into sentences and sentences further into words) and perform well on different languages without linguistic knowledge. The experiment shows that measure (iii) produces the rank with the highest number of aligned sentences, having the best performance in capturing the human preferences It did not outperform the sentence position baseline.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.