Establishing Cognitive Item Models for Fair and Theory-Grounded Automatic Item Generation: A Large-Scale Assessment Study with Image-Based Math Items

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Mathematics is a core domain in large-scale assessments (LSA), yet item development remains resource-intensive, limiting scalability and innovation. Automatic Item Generation (AIG) offers a promising solution, but empirical validations remain rare. This study investigates the psychometric functioning and fairness of 48 cognitive item models designed to generate language-reduced, image-based math items for Grades 1, 3, and 5. Treating these models as proto-theories, we generated 612 item instances varying in cognitive demands and contextual features. Using data from Luxembourg’s school monitoring (N = 35,058), we found that item difficulty was mainly driven by predefined cognitive factors, with stronger contextual influences in early grades. We introduce Differential Radical Functioning to evaluate whether AIG-based items permit comparable score interpretations across subgroups. Results reveal meaningful differences by cultural background, regardless of language proficiency. These findings highlight the importance of contextual embedding and demonstrate the potential of cognitive modeling in AIG for scalable, valid, and equitable assessments.

Similar Papers
  • Research Article
  • Cite Count Icon 5
  • 10.25282/ted.1376840
Psychometric Analysis of the First Turkish Multiple-Choice Questions Generated Using Automatic Item Generation Method in Medical Education
  • Dec 31, 2023
  • Tıp Eğitimi Dünyası
  • Yavuz Selim Kiyak + 3 more

Aim: Automatic item generation is "a process of using models to generate items using computer technology". The use of automatic item generation typically involves one of three primary methods: syntax-based, semantic-based, and template-based. Non-template automatic item generation approaches leverage natural language processing techniques. A study showed the potential of using template-based automatic item generation to create high-quality multiple-choice questions for assessing clinical reasoning in Turkish, marking a first in the field. However, the findings of the study were based only on expert opinions, necessitating further research to examine the psychometric qualities of Turkish items. The aim of this study was to reveal psychometric characteristics of the first Turkish case-based multiple-choice questions generated by using automatic item generation in medical education. Methods: This was a psychometric study. Three Turkish case-based multiple-choice questions generated using template-based automatic item generation on essential hypertension were included in an exam that 281 fourth-year medical students participate in. This examination was carried out in-person in classroom settings under proctor supervision. Item difficulty and item discrimination (point-biserial correlation) were calculated, and non-functioning distractors were determined. Results: All three items had acceptable levels (higher than 0.20) of point-biserial correlation (p<0.001). The item difficulty levels indicated the presence of one easy, one moderate, and one difficult question. Each item had 2-3 non-functioning options among five options. All three items had acceptable levels (higher than 0.20) of point-biserial correlation (p<0.001). The item difficulty levels indicated the presence of one easy, one moderate, and one difficult question. Each item had 2-3 non-functioning options among five options. Conclusions: The results indicated that the items successfully discriminate between high and low performers, providing validity evidence on the quality of the questions in evaluating students' comprehension of the subject. Additionally, the findings suggest that it is feasible to create multiple-choice questions with different difficulty levels in Turkish using a single automatic item generation model. This study demonstrated for the first time that automatic generation of case-based multiple-choice questions in Turkish produces acceptable psychometric characteristics in an authentic assessment setting in medical education. The ability to automatically generate effective multiple-choice questions in Turkish holds promise for enhancing the efficiency of written assessment in Turkish medical education.

  • Research Article
  • Cite Count Icon 38
  • 10.1080/10401334.2016.1146608
Using Automatic Item Generation to Improve the Quality of MCQ Distractors
  • Feb 5, 2016
  • Teaching and Learning in Medicine
  • Hollis Lai + 5 more

ABSTRACTConstruct: Automatic item generation (AIG) is an alternative method for producing large numbers of test items that integrate cognitive modeling with computer technology to systematically generate multiple-choice questions (MCQs). The purpose of our study is to describe and validate a method of generating plausible but incorrect distractors. Initial applications of AIG demonstrated its effectiveness in producing test items. However, expert review of the initial items identified a key limitation where the generation of implausible incorrect options, or distractors, might limit the applicability of items in real testing situations. Background: Medical educators require development of test items in large quantities to facilitate the continual assessment of student knowledge. Traditional item development processes are time-consuming and resource intensive. Studies have validated the quality of generated items through content expert review. However, no study has yet documented how generated items perform in a test administration. Moreover, no study has yet to validate AIG through student responses to generated test items. Approach: To validate our refined AIG method in generating plausible distractors, we collected psychometric evidence from a field test of the generated test items. A three-step process was used to generate test items in the area of jaundice. At least 455 Canadian and international medical graduates responded to each of the 13 generated items embedded in a high-stake exam administration. Item difficulty, discrimination, and index of discrimination estimates were calculated for the correct option as well as each distractor. Results: Item analysis results for the correct options suggest that the generated items measured candidate performances across a range of ability levels while providing a consistent level of discrimination for each item. Results for the distractors reveal that the generated items differentiated the low- from the high-performing candidates. Conclusions: Previous research on AIG highlighted how this item development method can be used to produce high-quality stems and correct options for MCQ exams. The purpose of the current study was to describe, illustrate, and evaluate a method for modeling plausible but incorrect options. Evidence provided in this study demonstrates that AIG can produce psychometrically sound test items. More important, by adapting the distractors to match the unique features presented in the stem and correct option, the generation of MCQs using automated procedure has the potential to produce plausible distractors and yield large numbers of high-quality items for medical education.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1007/s10459-023-10225-y
A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation
  • Apr 25, 2023
  • Advances in health sciences education : theory and practice
  • Filipe Falcão + 5 more

Automatic Item Generation (AIG) refers to the process of using cognitive models to generate test items using computer modules. It is a new but rapidly evolving research area where cognitive and psychometric theory are combined into digital framework. However, assessment of the item quality, usability and validity of AIG relative to traditional item development methods lacks clarification. This paper takes a top-down strong theory approach to evaluate AIG in medical education. Two studies were conducted: Study I—participants with different levels of clinical knowledge and item writing experience developed medical test items both manually and through AIG. Both item types were compared in terms of quality and usability (efficiency and learnability); Study II—Automatically generated items were included in a summative exam in the content area of surgery. A psychometric analysis based on Item Response Theory inspected the validity and quality of the AIG-items. Items generated by AIG presented quality, evidences of validity and were adequate for testing student’s knowledge. The time spent developing the contents for item generation (cognitive models) and the number of items generated did not vary considering the participants' item writing experience or clinical knowledge. AIG produces numerous high-quality items in a fast, economical and easy to learn process, even for inexperienced and without clinical training item writers. Medical schools may benefit from a substantial improvement in cost-efficiency in developing test items by using AIG. Item writing flaws can be significantly reduced thanks to the application of AIG's models, thus generating test items capable of accurately gauging students' knowledge.

  • Research Article
  • Cite Count Icon 36
  • 10.3109/0142159x.2016.1150989
Using cognitive models to develop quality multiple-choice questions
  • Mar 21, 2016
  • Medical Teacher
  • Debra Pugh + 4 more

With the recent interest in competency-based education, educators are being challenged to develop more assessment opportunities. As such, there is increased demand for exam content development, which can be a very labor-intense process. An innovative solution to this challenge has been the use of automatic item generation (AIG) to develop multiple-choice questions (MCQs). In AIG, computer technology is used to generate test items from cognitive models (i.e. representations of the knowledge and skills that are required to solve a problem). The main advantage yielded by AIG is the efficiency in generating items. Although technology for AIG relies on a linear programming approach, the same principles can also be used to improve traditional committee-based processes used in the development of MCQs. Using this approach, content experts deconstruct their clinical reasoning process to develop a cognitive model which, in turn, is used to create MCQs. This approach is appealing because it: (1) is efficient; (2) has been shown to produce items with psychometric properties comparable to those generated using a traditional approach; and (3) can be used to assess higher order skills (i.e. application of knowledge). The purpose of this article is to provide a novel framework for the development of high-quality MCQs using cognitive models.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/65726
Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG
  • Apr 4, 2025
  • JMIR Formative Research
  • Yavuz Selim Kıyak + 1 more

BackgroundTemplate-based automatic item generation (AIG) is more efficient than traditional item writing but it still heavily relies on expert effort in model development. While nontemplate-based AIG, leveraging artificial intelligence (AI), offers efficiency, it faces accuracy challenges. Medical education, a field that relies heavily on both formative and summative assessments with multiple choice questions, is in dire need of AI-based support for the efficient automatic generation of items.ObjectiveWe aimed to propose a hybrid AIG to demonstrate whether it is possible to generate item templates using AI in the field of medical education.MethodsThis is a mixed-methods methodological study with proof-of-concept elements. We propose the hybrid AIG method as a structured series of interactions between a human subject matter expert and AI, designed as a collaborative authoring effort. The method leverages AI to generate item models (templates) and cognitive models to combine the advantages of the two AIG approaches. To demonstrate how to create item models using hybrid AIG, we used 2 medical multiple-choice questions: one on respiratory infections in adults and another on acute allergic reactions in the pediatric population.ResultsThe hybrid AIG method we propose consists of 7 steps. The first 5 steps are performed by an expert in a customized AI environment. These involve providing a parent item, identifying elements for manipulation, selecting options and assigning values to elements, and generating the cognitive model. After a final expert review (Step 6), the content in the template can be used for item generation through a traditional (non-AI) software (Step 7). We showed that AI is capable of generating item templates for AIG under the control of a human expert in only 10 minutes. Leveraging AI in template development made it less challenging.ConclusionsThe hybrid AIG method transcends the traditional template-based approach by marrying the “art” that comes from AI as a “black box” with the “science” of algorithmic generation under the oversight of expert as a “marriage registrar”. It does not only capitalize on the strengths of both approaches but also mitigates their weaknesses, offering a human-AI collaboration to increase efficiency in medical education.

  • Research Article
  • 10.3724/sp.j.1041.2010.00802
The Impact on Ability Estimates of Predicted Parameters from Cognitively Designed Items in a Computerized Adaptive Testing Environment
  • Aug 31, 2010
  • Acta Psychologica Sinica
  • Xiang-Dong Yang

Automatic item generation has become a promising area in recent studies. In automatic item generation, items with targeted psychometric properties are generated during testing. The feasibility of automatic item generation lies in the fact that items are generated from a set of observable item stimulus features, which are mapped onto the cognitive variables underlying the item solution and are calibrated through cognitive psychometric models. Parameters of a generated item can then be predicted from the specific combination of the calibrated item stimulus features in the item. Predicted item parameters, compared to those calibrated from empirical data, involve more complex sources of uncertainty. Although the relationship between sufficiency of the cognitive model of item solving and the adequacy of item parameter prediction can be theoretically justified, the degree to which such predicted parameters impact various aspects of testing, however, is an empirical question and needs to be explored. This paper investigated the impact of predicted item parameters on ability estimates in a computerized adaptive testing environment, based on abstract reasoning test (ART) items which were generated using cognitive design system approach (Embretson, 1998). The item bank contained 150 items with two sets of item difficulties, of which one is predicted from the item design features and the other is calibrated from sample data. Each of the 263 subjects participated in the study received two subtests, of which one was based on predicted parameters, the other calibrated parameters. The item bank was split into two parallel halves based on predicted item parameters to prevent items in the bank from repeat administrations within subjects. Subjects were randomly assigned into one of the four testing procedures resulting from the combinations between parameter types (predicted versus calibrated) and item bank (first half versus second half). Results of the study showed a clear regression-to-the-mean effect of the predicted item parameters, compare to its counterpart of calibrated item parameters. Inward biases of ability estimates from subtest using predicted item parameters were observed when ability estimates were compared across different subtests within subjects. Compared to its counterpart using calibrated parameters, standard errors of ability estimates were larger for those from the subtest using predicted item parameters in the mid-range of the scale, where regression-to-the-mean effect of the predicted item parameters is minimal, and were smaller in the rest of the scale, possibly due to joint impact of increased uncertainty of predicted item parameters, estimation biases and limitation of the item bank at various levels of ability scale. When ability was estimated from the same subtest using different types of item parameters, very high correlation (.995) were obtained and no biases were observed throughout almost the entire scale. Standard errors of ability estimates were larger for predicted parameters yet the differences were small.

  • Research Article
  • 10.14686/buefad.1424213
Evaluating the Psychometric Characteristics of Generated Visual Reading Comprehension Items
  • Apr 16, 2024
  • Bartın University Journal of Faculty of Education
  • Ayfer Sayın

Reading comprehension, a crucial skill in today's information-rich environment, extends beyond text to include visual elements. Manual creation of visual reading comprehension items poses challenges, necessitating an innovative approach. This situation has led to the exploration of Automatic Item Generation (AIG) as a solution. This study aims to demonstrate the use of AIG for the creation of visual reading comprehension items. By developing cognitive and item models through expert input and utilizing computer algorithms for item generation, the study seeks to provide a time-efficient and reliable alternative for item writers. The field test involved 1,380 8th-grade students to evaluate the psychometric properties of the generated visual reading comprehension items. The AIG process starts with expert insights to develop cognitive and item models. Computer algorithms are then employed for AIG. The study utilizes a diverse sample of 8th-grade students for field testing, assessing the psychometric properties of the generated items. Field test results indicate the potential of AIG in efficiently producing a substantial item pool for visual reading comprehension. The generated items exhibit consistent difficulty levels (0.58 to 0.66), ensuring an appropriate challenge for students. High item discrimination (0.48 to 0.69) effectively distinguishes between students with varying visual reading comprehension skills. Item-total correlations (0.40 to 0.57) further validate the quality and validity of the generated items. The automated process yields efficient results in terms of item difficulty and discrimination, emphasizing the potential of AIG for high-quality assessment of visual reading comprehension items.

  • Research Article
  • Cite Count Icon 112
  • 10.1111/j.1365-2923.2012.04289.x
Using automatic item generation to create multiple‐choice test items
  • Jul 16, 2012
  • Medical Education
  • Mark J Gierl + 2 more

Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-88178-8_23
ILSA in Arts Education: The Effect of Drama on Competences
  • Jan 1, 2022
  • Rikke Gürgens Gjærum + 2 more

This chapter discusses the role of aesthetic education with a focus on educational drama and theatre. It investigates the lack of international large-scale assessment (ILSA) studies in the field of aesthetic education and exemplifies how to measure competence development in one of the aesthetic subjects: drama, based on the international mixed method large-scale assessment study DICE (Drama Improves Lisbon Key Competences in Education). The aim is to gain new understanding of the role of aesthetics in schooling, relating traditional philosophical arts theory from Aristotle and Dewey to relevant contemporary conceptualizations, such as twenty-first century skills (OECD), Lisbon Key Competences (EU), and Education for Sustainable Development (UNESCO). The discussion considers three main questions: Why does only a few international large-scale quantitative assessments of drama education exist? Why are researchers and practitioners in drama education skeptical about quantitative measurements? Can we design large-scale assessment studies in drama education?KeywordsEducational dramaApplied theatreArts educationAesthetic subjectsDICE

  • Research Article
  • 10.1111/emip.12019
Editorial
  • Sep 1, 2013
  • Educational Measurement: Issues and Practice
  • Derek C Briggs

Editorial

  • Research Article
  • Cite Count Icon 14
  • 10.1002/j.0022-0337.2016.80.3.tb06090.x
Three Modeling Applications to Promote Automatic Item Generation for Examinations in Dentistry
  • Mar 1, 2016
  • Journal of Dental Education
  • Hollis Lai + 4 more

Test items created for dentistry examinations are often individually written by content experts. This approach to item development is expensive because it requires the time and effort of many content experts but yields relatively few items. The aim of this study was to describe and illustrate how items can be generated using a systematic approach. Automatic item generation (AIG) is an alternative method that allows a small number of content experts to produce large numbers of items by integrating their domain expertise with computer technology. This article describes and illustrates how three modeling approaches to item content-item cloning, cognitive modeling, and image-anchored modeling-can be used to generate large numbers of multiple-choice test items for examinations in dentistry. Test items can be generated by combining the expertise of two content specialists with technology supported by AIG. A total of 5,467 new items were created during this study. From substitution of item content, to modeling appropriate responses based upon a cognitive model of correct responses, to generating items linked to specific graphical findings, AIG has the potential for meeting increasing demands for test items. Further, the methods described in this study can be generalized and applied to many other item types. Future research applications for AIG in dental education are discussed.

  • Research Article
  • Cite Count Icon 46
  • 10.1080/15305058.2011.635830
The Role of Item Models in Automatic Item Generation
  • Jul 1, 2012
  • International Journal of Testing
  • Mark J Gierl + 1 more

Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates or prototypes, that highlight the features or elements in the assessment task that must be manipulated. Second, these item model elements are manipulated to generate new items with the aid of computer-based algorithms. With this two-step process, hundreds or even thousands of new items can be created from a single item model. The purpose of our article is to describe seven different but related topics that are central to the development and use of item models for automatic item generation. We start by defining item model and highlighting some related concepts; we describe how item models are developed; we present an item model taxonomy; we illustrate how item models can be used for automatic item generation; we outline some benefits of using item models; we introduce the idea of an item model bank; and finally, we demonstrate how statistical procedures can be used to estimate the parameters of the generated items without the need for extensive field or pilot testing.

  • Research Article
  • Cite Count Icon 22
  • 10.1007/s10459-022-10092-z
Feasibility assurance: a review of automatic item generation in medical assessment.
  • Mar 1, 2022
  • Advances in Health Sciences Education
  • Filipe Falcão + 2 more

BackgroundCurrent demand for multiple-choice questions (MCQs) in medical assessment is greater than the supply. Consequently, an urgency for new item development methods arises. Automatic Item Generation (AIG) promises to overcome this burden, generating calibrated items based on the work of computer algorithms. Despite the promising scenario, there is still no evidence to encourage a general application of AIG in medical assessment. It is therefore important to evaluate AIG regarding its feasibility, validity and item quality.ObjectiveProvide a narrative review regarding the feasibility, validity and item quality of AIG in medical assessment.MethodsElectronic databases were searched for peer-reviewed, English language articles published between 2000 and 2021 by means of the terms ‘Automatic Item Generation’, ‘Automated Item Generation’, ‘AIG’, ‘medical assessment’ and ‘medical education’. Reviewers screened 119 records and 13 full texts were checked according to the inclusion criteria. A validity framework was implemented in the included studies to draw conclusions regarding the validity of AIG.ResultsA total of 10 articles were included in the review. Synthesized data suggests that AIG is a valid and feasible method capable of generating high-quality items.ConclusionsAIG can solve current problems related to item development. It reveals itself as an auspicious next-generation technique for the future of medical assessment, promising several quality items both quickly and economically.

  • Research Article
  • 10.4324/9780203803912-16
Learning Sciences, Cognitive Models, and Automatic Item Generation
  • Aug 21, 2012
  • Jacqueline P Leighton

Learning Sciences, Cognitive Models, and Automatic Item Generation

  • Research Article
  • Cite Count Icon 8
  • 10.1080/10401334.2022.2119569
Three Sources of Validation Evidence Needed to Evaluate the Quality of Generated Test Items for Medical Licensure
  • Aug 30, 2022
  • Teaching and Learning in Medicine
  • Mark Gierl + 4 more

Issue: Automatic item generation is a method for creating medical items using an automated, technological solution. Automatic item generation is a contemporary method that can scale the item development process for production of large numbers of new items, support building of multiple forms, and allow rapid responses to changing medical content guidelines and threats to test security. The purpose of this analysis is to describe three sources of validation evidence that are required when producing high-quality medical licensure test items to ensure evidence for valid test score inferences, using the automatic item generation methodology for test development. Evidence: Generated items are used to make inferences about examinees’ medical knowledge, skills, and competencies. We present three sources of evidence required to evaluate the quality of the generated items that is necessary to ensure the generated items measure the intended knowledge, skills, and competencies. The sources of evidence we present here relate to the item definition, the item development process, and the item quality review. An item is defined as an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. This definition allows for a critique of the input used for automatic item generation. The item development process is evaluated using a validation table, whose purpose is to support verification of the assumptions related to model specification made by the subject-matter expert. This table provides a succinct summary of the content and constraints that were used to create new items. The item quality review is used to evaluate the statistical quality of the generated items, which often focuses on the difficulty and the discrimination of the correct and incorrect options. Implications: Automatic item generation is an increasingly popular item development method. The generated items from this process must be bolstered by evidence to ensure the items measure the intended knowledge, skills, and competencies. The purpose of this analysis is to describe these sources of evidence that can be used to evaluate the quality of the generated items. The important role of medical expertise in the development and evaluation of the generated items is highlighted as a crucial requirement for producing validation evidence.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.