Abstract

With an increasing number of new summarization systems proposed in recent years, an automatic text evaluation metric that can accurately and reliably rate the performance of summarization systems has been a pressing need. However, current automatic text evaluation metrics can only measure one or certain aspects of the quality between two summary texts and do not agree with human judgments consistently. In this paper, we show that combining multiple well-chosen evaluation metrics and training predictive models using human annotated datasets can lead to more reliable evaluation scores than using any individual automatic metric. Our predictive models trained on a human annotated subset of the CNN/DailyMail corpus demonstrate significant improvements (e.g. approximately 25% along coherence dimension) over selected individual metrics. Furthermore, a concise meta-evaluation on automatic metrics is provided along with an analysis of the performance of 12 predictive models. We also investigate the sensitivity of automatic metrics when mixed together for training these models. We have made the code, the instructions for experiment setup, and the trained models available as a tool for comparing and evaluating text summarization systems. a a https://github.com/bzhao2718/ReliableSummEvalReg .

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.