The process of manually generating question and answer (QA) pairs for assessments is known to be a time-consuming and energy-intensive task for teachers, specifically in higher education. Several studies have proposed various methods utilising pre-trained large language models for the generation of QA pairs. However, it is worth noting that these methods have primarily been evaluated on datasets that are not specifically educational in nature. Furthermore, the evaluation metrics and strategies employed in these studies differ significantly from those typically used in educational contexts. The present discourse fails to present a compelling case regarding the efficacy and practicality of stated methods within the context of higher education. This study aimed to examine multiple QA pairs generation approaches in relation to their performance and the efficacy and constraints within the context of higher education. The various approaches encompassed in this study comprise pipeline, joint, multi-task approach. The performance of these approaches under consideration was assessed on three datasets related to distinct courses. The evaluation integrates three automated methods, teacher assessments, and real-world educational evaluations to provide a comprehensive analysis. The comparison of various approaches was conducted by directly assessing their performance using the average scores of different automatic metrics on three datasets. The results of the teachers and real educational evaluation indicate that the assessments generated were beneficial in enhancing the understanding of concepts and overall performance of students. The implications of the findings from this study hold significant importance in enhancing the efficacy of QA pair generation tools within the context of higher education.
Read full abstract