WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Hiroaki Hayashi,Prashant Budania,Chris Ackerson,Peng Wang,Graham Neubig,Raj Neervannan

doi:10.1162/tacl_a_00362

Abstract

AbstractAspect-based summarization is the task of generating focused summaries based on specific points of interest. Such summaries aid efficient analysis of text, such as quickly understanding reviews or opinions from different angles. However, due to large differences in the type of aspects for different domains (e.g., sentiment, product features), the development of previous models has tended to be domain-specific. In this paper, we propose WikiAsp,1 a large-scale dataset for multi-domain aspect- based summarization that attempts to spur research in the direction of open-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. We propose several straightforward baseline models for this task and conduct experiments on the dataset. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.

Highlights

Aspect-based summarization is a subtask of summarization that aims to provide targeted summaries of a document from different perspectives (Titov and McDonald, 2008; Lu et al, 2009; Wang and Ling, 2016; Yang et al, 2018; Angelidis and Lapata, 2018)
We propose a large-scale, multi-domain multi-aspect summarization dataset derived from Wikipedia
Our analysis has demonstrated that there are both general challenges regarding summarization into various aspects, as well as specific challenges in particular genres such as time-consistent mentions and proper pronoun conversion depending on the writer of the original content

Summary

Introduction

Aspect-based summarization is a subtask of summarization that aims to provide targeted summaries of a document from different perspectives (Titov and McDonald, 2008; Lu et al, 2009; Wang and Ling, 2016; Yang et al, 2018; Angelidis and Lapata, 2018). Unlike generic summarization, this gives more concise summaries that are separated according to specific points of interest, allowing readers to fulfill focused information needs more and quickly. We argue that (1) generating structured summaries is of inherent interest, as these will allow humans consuming the information to browse specific aspects of interest more readily, and (2) the structure will vary across domains, with different domains demonstrating very different characteristics

Objectives

Methods

Results

Conclusion