Towards a Reliable Text Summarization Evaluation Metric Using Predictive Models

Bo Zhao,Yui Man Lui

doi:10.1142/s0218001422510119

Abstract

With an increasing number of new summarization systems proposed in recent years, an automatic text evaluation metric that can accurately and reliably rate the performance of summarization systems has been a pressing need. However, current automatic text evaluation metrics can only measure one or certain aspects of the quality between two summary texts and do not agree with human judgments consistently. In this paper, we show that combining multiple well-chosen evaluation metrics and training predictive models using human annotated datasets can lead to more reliable evaluation scores than using any individual automatic metric. Our predictive models trained on a human annotated subset of the CNN/DailyMail corpus demonstrate significant improvements (e.g. approximately 25% along coherence dimension) over selected individual metrics. Furthermore, a concise meta-evaluation on automatic metrics is provided along with an analysis of the performance of 12 predictive models. We also investigate the sensitivity of automatic metrics when mixed together for training these models. We have made the code, the instructions for experiment setup, and the trained models available as a tool for comparing and evaluating text summarization systems. a a https://github.com/bzhao2718/ReliableSummEvalReg .

Full Text