Towards a Better Metric for Evaluating Question Generation Systems

Preksha Nema,Mitesh M Khapra

doi:10.18653/v1/d18-1429

Abstract

There has always been criticism for using n-gram based similarity metrics, such as BLEU, NIST, etc, for evaluating the performance of NLG systems. However, these metrics continue to remain popular and are recently being used for evaluating the performance of systems which automatically generate questions from documents, knowledge graphs, images, etc. Given the rising interest in such automatic question generation (AQG) systems, it is important to objectively examine whether these metrics are suitable for this task. In particular, it is important to verify whether such metrics used for evaluating AQG systems focus on answerability of the generated question by preferring questions which contain all relevant information such as question type (Wh-types), entities, relations, etc. In this work, we show that current automatic evaluation metrics based on n-gram similarity do not always correlate well with human judgments about answerability of a question. To alleviate this problem and as a first step towards better evaluation metrics for AQG, we introduce a scoring function to capture answerability and show that when this scoring function is integrated with existing metrics, they correlate significantly better with human judgments. The scripts and data developed as a part of this work are made publicly available.

Highlights

This work is a first step in that direction where we propose that apart from n-gram similarity, any metric for Automatic Question Generation (AQG) should take into account the answerability of the generated questions
Our work is a first step in this direction, and we hope it will lead to more research in designing the right metrics for AQG
We took noisy generated questions from three different tasks, viz., document Question Answering (QA), knowledge base QA and visual QA, and showed that the answerability scores assigned by humans did not correlate well with existing metrics

Summary

Introduction

The advent of large scale datasets for document Question Answering (QA) (Rajpurkar et al, 2016; Nguyen et al, 2016; Joshi et al, 2017; Saha et al, 2018a) knowledge base driven QA (Bordes et al, 2015; Saha et al, 2018b) and Visual QA (Antol et al, 2015; Johnson et al, 2017) has enabled the development of end-to-end supervised models for1https://github.com/PrekshaNema25/ Answerability-MetricDocument: In 1648 before the term “genocide” had been coined , the Peace of Westphalia was established to protect ethnic, racial and in some instances religious groups. Creating newer datasets for specific domains or augmenting existing datasets with more data is a tedious, time-consuming and expensive process. To alleviate this problem and create even more training data, there is growing interest in developing techniques that can automatically generate questions from a given source, say a document (Du et al, 2017; Du and Cardie, 2017), knowledge base (Reddy et al, 2017; Serban et al, 2016), or image (Li et al, 2017).

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards a Better Metric for Evaluating Question Generation Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 83	License type: cc-by

Similar Papers

Comparison of template-based and multilayer perceptron-based approach for automatic question generation system
Walelign Tewabe Sewunetie ... László Kovács
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28
Walelign Tewabe Sewunetie, et. al.Walelign Tewabe Sewunetie ... László Kovács
01 Dec 2022
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 28

Tapping into the Power of Automatic Question Generation
Amal Elsayed Aboutabl ...
International Journal of Computer Applications | VOL. 103
Amal Elsayed Aboutabl, et. al.Amal Elsayed Aboutabl ...
18 Oct 2014
International Journal of Computer Applications | VOL. 103

G-Asks: An Intelligent Automatic Question Generation System for Academic Writing Support
Ming Liu ... Rafael A Calvo
Dialogue & Discourse | VOL. 3
Ming Liu, et. al.Ming Liu ... Rafael A Calvo
16 Mar 2012
Dialogue & Discourse | VOL. 3

Thematic Question Generation over Knowledge Bases
Tanguy Raynaud ... Frederique Laforest
-
Tanguy Raynaud, et. al.Tanguy Raynaud ... Frederique Laforest
01 Dec 2018
01 Dec 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards a Better Metric for Evaluating Question Generation Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers