Evaluating Question generation models using QA systems and Semantic Textual Similarity

Safwan Shaheer,Sudipta Nandi Sarna,Annajiat Alim Rasel,Ishmam Hossain,Md Humaion Kabir Mehedi

doi:10.1109/ccwc57344.2023.10099244

Abstract

Question generation based on conversational context is a difficult problem to solve. A widely used technique for generating quality questions using fine-tuned models relies on a suitable answer and the context, usually the passage. But when it comes to conversational settings, the questions generated are not of the highest quality as they lack the contextual element in the question, especially due to the lack of co-reference resolution of the entity. Furthermore, in most of the evaluation techniques for generating questions, there seems to be a lack of utilizing powerful question-answering systems to judge the answerability of the questions generated. The most prevalent metric used for judging machine-generated text against the human gold standard, BLUE, unfortunately doesn't factor in whether a question answering system would be able to answer the question, but instead focuses mostly on the number of substrings that match against each other. Various question generation models following a generalized encoder-decoder architecture were evaluated using semantic textual similarity for both the generated questions and the generated answers. Although higher parameters in a model usually lend to better performance, our experiment displayed that such is not always the case, at least when there is a massive amount of context missing.

Full Text