Articles published on Item calibration
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
327 Search results
Sort by Recency
- New
- Research Article
- 10.1177/01466216261420758
- Feb 6, 2026
- Applied Psychological Measurement
- Jonas Bjermo + 2 more
Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.
- New
- Research Article
- 10.70478/psicothema.2026.38.01
- Feb 1, 2026
- Psicothema
- Javier Suárez-Álvarez + 3 more
Artificial Intelligence (AI) is increasingly used to enhance traditional assessment practices by improving efficiency, reducing costs, and enabling greater scalability. However, its use has largely been confined to large corporations, with limited uptake by researchers and practitioners. This study aims to critically review current AI-based applications in test construction and propose practical guidelines to help maximize their benefits while addressing potential risks. A comprehensive literature review was conducted to examine recent advances in AI-based test construction, focusing on item development and calibration, with real-world examples to demonstrate practical implementation. Best practices for AI in test development are evolving, but responsible use requires ongoing human oversight. Effective AI-based item generation depends on quality training data, alignment with intended use, model comparison, and output validation. For calibration, essential steps include defining construct validity, applying prompt engineering, checking semantic alignment, conducting pseudo factor analysis, and evaluating model fit with exploratory methods. We propose a practical guide for using generative AI in test development and calibration, targeting challenges related to validity, reliability, and fairness by linking each issue to specific guidelines that promote responsible, effective implementation.
- Research Article
- 10.58578/tsaqofah.v6i1.8384
- Dec 17, 2025
- TSAQOFAH
- Ahmad Rosyid Ridho + 1 more
Attitude measurement has been widely examined in social and educational studies; however, the challenges of transforming subjective viewpoints into objective numerical data and determining appropriate measurement tools still require in-depth methodological scrutiny. This study aims to examine various types of attitude scales and underscore the urgency of validity and reliability testing in the design of research instruments. The research employed a qualitative method based on library research, collecting data from secondary sources in the form of books and scholarly journals through systematic literature searches, which were subsequently analyzed using content analysis techniques. The findings identify four main categories of scales for measuring affective aspects, namely the Likert scale, Guttman scale, semantic differential, and Thurstone scale, each of which has distinct characteristics in producing nominal, ordinal, and interval-level data and in detecting the intensity of respondents’ attitudes. These findings enrich the research methodology literature by clarifying the function and position of each scale in the selection of instruments aligned with research purposes and design. It is concluded that the accuracy of attitude measurement is strongly influenced by the appropriateness of scale selection in relation to the characteristics of the construct and data, as well as the rigor of validity and reliability testing; therefore, researchers are recommended to select scales according to the desired level of precision and to consider the use of advanced analytical models such as the Rasch model for item calibration purposes.
- Research Article
- 10.3390/jcm14248774
- Dec 11, 2025
- Journal of Clinical Medicine
- Hadeel R Bakhsh + 9 more
Background/Objectives: The cancer experience has a significant affective impact on patients, often causing anxiety and depression. Given the importance of this condition, there is a true need for psychometrically valid and culturally appropriate assessment tools for anxiety and depression in this condition. This is also true for Arabic-speaking populations. This study evaluates the measurement properties of the PROMIS Depression in Cancer (PROMIS-Ca-D) and Anxiety in Cancer (PROMIS-Ca-A) questionnaires, part of the Patient-Reported Outcomes Measurement Information System® (PROMIS®), for assessing depression and anxiety in Saudi Arabian cancer patients. Methods: The PROMIS-Ca-D was translated into Arabic and subsequently tested with 30 participants from five Arabic-speaking countries. The PROMIS-Ca-A had been previously translated into Arabic. The second phase recruited 213 cancer patients in Riyadh, Saudi Arabia, who completed the PROMIS-Ca-D and PROMIS-Ca-A. Rasch analysis (rating scale model) was used to assess category functioning, item fit, unidimensionality, differential item functioning, and measures reliability. Results: The translation process confirmed the cultural appropriateness of the Arabic PROMIS-Ca-D. In the validation cohort (N = 213), Rasch analysis indicated excellent reliability for both scales. Although disordered modal thresholds and signs of multidimensionality were observed, control analyses confirmed that these features did not compromise the item calibrations or the person’s measures. Item fit was adequate, and Differential Item Functioning was negligible. However, suboptimal item-person targeting was noted. Conclusions: The Arabic PROMIS-Ca-D and PROMIS-Ca-A are psychometrically sound instruments for evaluating psychological distress in Arabic-speaking cancer patients. Future research should focus on assessing responsiveness and evaluating metric equivalence with legacy measures.
- Research Article
- 10.1080/0142159x.2025.2586619
- Nov 20, 2025
- Medical Teacher
- Shicong Feng + 4 more
Introduction Item difficulty prediction is crucial for planning and administrating educational assessments, especially those with high-stakes such as medical licensing examinations. The inconsistent findings across existing studies, however, highlight a critical gap in understanding which modeling components are most influential. This research addresses this gap by systematically investigating several key factors hypothesized to affect prediction performance. Methods This study explored the impact of: (1) model domain specificity, (2) input content granularity (e.g. item stem, correct answer, and distractors), (3) embedding dimensionality, and (4) the choice of the machine learning regressor. By selecting a range of embedding models and a series of Machine Learning models to predict the difficulty of 2815 Multiple-Choice Questions sourced from the National Center for Health Professions Education Development. Results Analyses revealed that XGBoost outperformed other counterparts (Mean RMSE = 0.1779), and the use of a domain-specific MedEmbed-small embedding model consistently improved prediction accuracy (Mean RMSE = 0.1860). Notably, using the item stem and the correct answer as input features achieved the best trade-off between predictive accuracy and model parsimony (RMSE = 0.1756). Discussion These findings offer valuable insights for data-driven measurement practices including Automated Item Calibration, Computerized Adaptive Testing, and Intelligent Tutoring Systems in medical education. Furthermore, this study revealed that the optimal feature set for difficulty prediction is contingent on the item style. Future research should extend this line of inquiry to the difficulty prediction of Multimodal test items. Practice points The choice of machine learning algorithm, particularly XGBoost, is the most critical factor for accurate item difficulty prediction. Domain-specific embeddings greatly improve predictive performance over general-purpose models in medical education contexts. Using the item stem and correct answer as input features provides an optimal balance between prediction accuracy and model parsimony. A computationally efficient, high-performing model is achievable without relying on larger, more complex deep learning architectures. These findings provide a practical, low-cost framework for developing AI-assisted assessment tools like automated test assembly.
- Research Article
- 10.1007/s12571-025-01611-y
- Nov 10, 2025
- Food Security
- Ka Yu Kwan + 5 more
Abstract Food security is vital for global health and sustainable development. However, existing validated versions of the Household Food Security Survey Module (HFSSM) cannot address the needs for a shorter Chinese version with a shorter reference period. This study aimed to examine the psychometric properties of a 6-item Chinese-version of the HFSSM (3-month reference period), among the economically deprived population in Hong Kong. This study included 512 observations from community-dwelling Chinese adults (≥ 18 years), who were social services users or food assistant service applicants, and with household income < 75% of the population median. Construct validity and scale targeting were assessed using Rasch analysis. Convergent validity, known-group validity, internal consistency, and test-retest reliability were also examined. The construct validity was satisfactory, with a logical item calibration hierarchy and good infit statistics for five items (0.81–1.26), although a floor effect was indicated in the Wright map. The convergent validity was supported through a moderate positive correlation with the single-item household food insufficiency scale ( r = 0.58). Known-group validity showed higher scores among those aged < 65 years (d = 1.17, p < 0.001), living with children aged < 18 years (d = 1.09, p < 0.001), enrolled in the food assistance program (d = 0.54, p < 0.001), living at the poverty line or below (d = 0.30, p = 0.010), and with poor or fair health (d = 0.20, p = 0.028). Furthermore, the scale had fair-to-good internal consistency (Cronbach’s alpha = 0.88, Rasch person reliability = 0.64) and excellent test-retest reliability (ICC = 0.96) in one week. The 6-item Chinese-version of the HFSSM (3-month reference period) is a valid and reliable measure of household food insecurity in the economically deprived population in Hong Kong.
- Research Article
- 10.1111/jedm.70016
- Oct 26, 2025
- Journal of Educational Measurement
- Xi Wang + 1 more
Abstract This study builds on prior research on adaptive testing by examining the performance of item calibration methods in the context of multidimensional multistage tests with within‐item multidimensionality. Building on the adaptive module‐level approach, where test‐takers proceed through customized modules based on their initial performance, this research investigates how different calibration methods perform under certain conditions. Specifically, the study evaluates three calibration methods—concurrent calibration, fixed item parameter calibration, and concurrent calibration with multiple panels—within a multidimensional multistage test framework. Using computer simulations, the study assesses ability and item parameter recovery across various conditions, including sample size, correlations among dimensions, and routing stage length. Across 36 simulation conditions, each replicated 10 times, results show that although calibration methods exert minimal influence on item and ability parameter estimates, the correlation among dimensions plays a significant role in both item and ability estimation. Additionally, sample size and routing stage length notably impact the estimation of item discrimination parameters. This study lays the foundation for further research and practical advancements in multidimensional multistage testing, offering a starting point for refining and innovating testing practices.
- Research Article
- 10.1111/jedm.70012
- Oct 8, 2025
- Journal of Educational Measurement
- Jing Huang + 5 more
Abstract Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD‐CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two‐phase test design with a content‐balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade‐offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.
- Research Article
- 10.3390/jcm14196992
- Oct 2, 2025
- Journal of clinical medicine
- Martino Belvederi Murri + 12 more
Background: Demoralization, anxiety, irritability, and depression are common among hospital patients and are associated with poorer outcomes and greater healthcare burden. Early identification is essential, but simultaneous screening across multiple domains is often impractical with questionnaires. Computerized Adaptive Testing (CAT) offers a solution by tailoring item administration, reducing test length while preserving measurement precision. The aim of this study was to develop and validate FRIDA (Four-factor Rapid Interactive Diagnostic Assessment), a freely accessible, web-based CAT for rapid multidimensional screening of psychopathology in hospital patients. Methods: We analysed data from 472 medically ill in-patients at a University Hospital. Item calibration was performed using a four-factor graded response model (demoralization, anxiety, irritability, depression) in the mirt package. CAT simulations were run with 1000 virtual respondents to optimize item selection, exposure control, and stopping rules. The best configuration was applied to the real dataset. Criterion validity for demoralization was evaluated against the Diagnostic Criteria for Psychosomatic Research (DCPR). Results: The four-factor model showed good fit (CFI = 0.947, RMSEA = 0.080). Factor correlations were moderate to high, with the strongest overlap between demoralization and depression (r = 0.93). In simulations, the CAT required, on average, 7.8 items and recovered trait estimates with high accuracy (r = 0.94-0.97). In real patients, mean test length was 11 items, with accuracy of r = 0.95 across domains. FRIDA demonstrated good criterion validity for demoralization (AUC = 0.816; sensitivity 80%, specificity 67.5%). Conclusions: FRIDA is the first freely available, multidimensional CAT for rapid screening of psychopathology in hospital patients. It offers a scalable, efficient, and precise tool for integrating mental health assessment into routine hospital care.
- Research Article
- 10.55981/salls.2025.13083
- Sep 1, 2025
- Southeast Asian Language and Literature Studies
- Bayu Permana Sukma + 6 more
Reading is a fundamental tool for gleaning information and engaging in many fields of literacy. However, poor reading outcomes among Indonesian students on the PISA test indicate a low literacy level. The reading ability referred to in this study is the ability to understand reading texts on various topics and forms. This study used a sample of 234 students from eight schools in North Kalimantan Province. The data were obtained from reading tests carried out by students via computers. The test consists of 40 multiple-choice questions, short answers, and essays. The data was processed in several stages, including data reduction and item calibration using the Rasch Model. Apart from reading tests, data were also obtained through questionnaires to gather information about literacy activities carried out by students at school and at home. The results showed that nationally the average reading literacy ability of students in North Kalimantan Province was at a low level, namely at score 349, while the highest reading literacy ability average score in Indonesia was 489. The results also showed that there was a relationship between students' reading literacy skills and their conditions and habits at home. As a recommendation, students need to be more accustomed to reading multimodal texts through multimedia devices. It is important to make them capable of understanding texts containing not only words, but also pictures, numbers, graphs, and tables as demands for modern literacy. Moreover, parents also need to be to encourage to help improve the students’ literacy capacity.
- Research Article
- 10.3390/jintelligence13080102
- Aug 12, 2025
- Journal of Intelligence
- Markus Sommer + 1 more
This article provides a critical review of conceptually different approaches to automatic and transformer-based automatic item generation. Based on a discussion of the current challenges that have arisen due to changes in the use of psychometric tests in recent decades, we outline the requirements that these approaches should ideally fulfill. Subsequently, each approach is examined individually to determine the extent to which it can contribute to meeting the challenges. In doing so, we will focus on the cost savings during the actual item construction phase, the extent to which they may contribute to enhancing test validity, and potential cost savings in the item calibration phase due to either a reduction in the sample size required for item calibration or a reduction in the item loss due to insufficient psychometric characteristics. In addition, the article also aims to outline common recurring themes across these conceptually different approaches and outline areas within each approach that warrant further scientific research.
- Research Article
- 10.70478/psicothema.2025.37.24
- Aug 1, 2025
- Psicothema
- Pere J Ferrando + 3 more
Likert-type scales, first introduced by Rensis Likert in 1932, have become one of the most widely used assessment tools across a range of scientific and professional domains, owing to their simplicity and effectiveness. The purpose of the present study is to critically review their use and to propose a set of practical guidelines aimed at optimizing their construction, analysis, and application. A systematic literature review of guidelines focused on the development, analysis, scoring, use, and interpretation of Likert scales was carried out. Several key areas for improvement in the construction and use of Likert-type scales were identified, including the operational definition of constructs, item formulation, selection of the number of response categories, response analysis, collection of validity evidence, item calibration, and score interpretation. Based on the findings, a practical guide comprising fifteen recommendations is proposed: ten focused on the appropriate design, construction, and analysis of Likert scales, and five aimed at guiding appropriate use of pre-existing scales by researchers and practitioners.
- Research Article
- 10.52589/bjeldp-4skvbgua
- Jul 30, 2025
- British Journal of Education, Learning and Development Psychology
- Afemikhe, O A + 1 more
This study assessed the psychometric properties of the Mathematics Achievement test for Secondary School Students in Edo State, Nigeria, using the four-parameter logistic model (4PLM) of Item Response Theory (IRT). The study adopted a descriptive survey design. The population comprised students from 312 public junior secondary schools in Edo State, while the sample consisted of 2,204 students selected from this population. The research instrument was a 40-item multiple-choice Mathematics Achievement developed by Afemikhe and Imasuen (2024). The instrument, previously validated and standardized, had a reliability coefficient of 0.89 using the Kuder-Richardson Formula 20 (KR-20). Unidimensionality of the data was verified through Principal Component Analysis using SPSS, while item calibration was conducted with Jmetrik IRT software to estimate item difficulty, discrimination, guessing, and carelessness parameters. The results revealed that most items demonstrated very high discrimination, indicating a strong capacity to differentiate between students with high and low levels of achievement in mathematics. Most items were difficult, suggesting that the test provided sufficient challenge for students. However, a high proportion of items displayed elevated guessing parameters, reflecting issues with distractor quality. On the positive side, carelessness was generally low, suggesting that students responded attentively. Based on the findings, it was recommended that the distractors of test items of the test be reviewed and improved to reduce guessing and that IRT frameworks be more widely adopted in the evaluation of educational assessments.
- Research Article
- 10.19090/pp.v18i2.2559
- Jul 14, 2025
- Primenjena psihologija
- Diana Mutiah + 2 more
This study aimed to evaluate the psychometric properties of the 10-item Indonesian version of the Brief Self-Control Scale (BSCS). It used the polytomous Rasch model, which enables more detailed analysis, including differential item functioning (DIF) analysis. The participants in this study were 1001 Indonesian high school students. We found that the partial credit model (PCM) was a better fit than the rating scale model. Furthermore, the unidimensionality, local independence, and monotonicity assumptions of the PCM were valid for the BSCS. Q5 was the only item that did not fit the PCM. The step parameters of the BSCS functioned well, with values ranging from low to high, as expected, for all items, indicating monotonicity. Person separation reliability was 0.71, indicating that the BSCS has good internal consistency. The DIF analysis showed that item Q5 functioned differently across genders. In general, the remaining nine items of the BSCS have good psychometric properties for measuring self-control.
- Research Article
- 10.1044/2025_ajslp-24-00524
- Jul 10, 2025
- American journal of speech-language pathology
- Lillian Durán + 7 more
Conceptual scoring is a useful approach to bilingual vocabulary tests that can identify language delays or impairments by considering bilingual children's lexical-semantic knowledge in both languages. The purpose of this study was to develop and calibrate a conceptually scored expressive vocabulary measure, the Multitudes Expressive Vocabulary (EVO) task, for use in screening Spanish-English bilingual children. Item design of the English and Spanish items was informed by prior literature and bilingual corpus data, and item review was conducted to ensure linguistic appropriateness and to minimize racial or cultural bias in English and Spanish versions. To begin item calibration in each language, English and Spanish items were administered to the same 1,219 bilingual children enrolled in kindergarten and first grade. Item-level difficulties were calculated using Rasch modeling in each language and then were correlated across languages. Correlations met minimum thresholds, which justified joint calibration on a unitary scale, and there was evidence of unidimensionality. The conceptually scored version had appropriate item fit statistics across the range of ability. Finally, moderately positive correlations with an existing measure of bilingual expressive vocabulary provided evidence of criterion validity. The development process of the Multitudes conceptually scored expressive vocabulary screening measure is described. A final set of empirically derived items had appropriate fit statistics and had evidence of construct validity when conceptually scored. Multitudes EVO represents an innovation in universal screening by allowing students to respond in English or Spanish, which improves accuracy and efficiency.
- Research Article
- 10.24042/tadris.v10i1.26547
- Jun 29, 2025
- Tadris: Jurnal Keguruan dan Ilmu Tarbiyah
- Fenny Thresia + 3 more
This study aims to develop and validate the Multiculturally Responsive Teaching (MRT) Instrument to assess the multicultural competence of pre-service English teachers. Multicultural competence is essential in fostering inclusive learning environments, yet there is a lack of standardized instruments tailored to measure this construct effectively within teacher education. The study employs a Research and Development (R&D) approach using the ADDIE model (Analysis, Design, Development, Implementation, Evaluation). The instrument was designed based on key dimensions of multicultural competence: cognitive, affective, and behavioral aspects. Data were collected from 335 English Education students across Indonesia, and analyzed using the Rasch Model to assess the instrument's validity and reliability. Findings indicate that the MRT instrument demonstrates high internal consistency, with a Cronbach’s Alpha of 0.92, a person reliability score of 10.14, and an item reliability score of 5.14, suggesting strong measurement precision. The Person-Item Map (Wright Map) confirms the instrument's ability to differentiate respondents' competence levels, while the rating scale analysis supports the functionality of the Likert scale categories. The classification of item difficulty based on Logit Value Index (LVI) reveals a well-structured hierarchy, ensuring appropriate item calibration. The study concludes that the MRT instrument is a valid and reliable tool for measuring multicultural competence among pre-service English teachers. Its application can support curriculum enhancement, professional development, and policy formulation in teacher education programs. Future research is encouraged to expand its implementation in different educational contexts to further validate its effectiveness.
- Research Article
1
- 10.1111/bmsp.12395
- Jun 6, 2025
- The British Journal of Mathematical and Statistical Psychology
- Maria Bolsinova + 2 more
The Elo Rating System which originates from competitive chess has been widely utilised in large‐scale online educational applications where it is used for on‐the‐fly estimation of ability, item calibration, and adaptivity. In this paper, we aim to critically analyse the shortcomings of the Elo rating system in an educational context, shedding light on its measurement properties and when these may fall short in accurately capturing student abilities and item difficulties. In a simulation study, we look at the asymptotic properties of the Elo rating system. Our results show that the Elo ratings are generally not unbiased and their variances are context‐dependent. Furthermore, in scenarios where items are selected adaptively based on the current ratings and the item difficulties are updated alongside the student abilities, the variance of the ratings across items and students artificially increases over time and as a result the ratings do not converge. We propose a solution to this problem which entails using two parallel chains of ratings which remove the dependence of item selection on the current errors in the ratings.
- Research Article
- 10.1192/bjo.2025.10453
- Jun 1, 2025
- BJPsych Open
- Sureyya Melike Toparlak + 4 more
Aims: Main aim of this quality improvement project was to ensure that each outpatient CAMHS clinic room is equipped with the necessary, functional equipment for safe and comprehensive baseline and ongoing monitoring of patients prescribed antidepressant, antipsychotic, and ADHD medications, in line with NICE guidelines, the Maudsley Prescribing Guidelines and hospital guidelines.Methods: The sites included in this project are Raglan house (single point of access), The Clock house (Community Service South Oxfordshire), Slade and Maple house (neurodevelopmental conditions outpatient service in Oxford). Data collection was conducted with the help of a checklist to be used for each clinic room in outpatient CAMHS. All four sites were included in the data interpretation process each having 54 items in the checklist. It includes quantitative and qualitative data which are crucial to ensure standards and to meet requirements by the guidelines mentioned above. These items consist of three groups, physical health monitoring, infection prevention as well as privacy, confidentiality and comfort. Some of the items were window blinds and engaged/vacant for privacy and confidentiality; sanitiser and soap for infection prevention; stethoscope and height measurement tool for physical health.Results: On average, 44% of checklist items were present on the sites, which means 56% items were not available. Of the present items, 96% were working well, whereas 4% were dysfunctional such as a clock with no battery, an unstable scale, a faulty thermometer and limited amount of sanitiser. Moreover, concerns were raised about shortage of rooms for routine and urgent appointments across multiple sites despite online and telephone appointments being offered. In addition, some of the rooms did not have appropriate lighting. The issues that pose immediate risk to patients’ safety were prioritised and reported to the estates. The rest is planned to be reported at the time this abstract was written.Conclusion: Functional clinical equipment is essential to ensure patient safety. Efficient and active use of channels to report missing or dysfunctional items as well as regular maintenance and calibration of clinical items are key to excellent caring and safe care. All staff members are responsible to make sure that appropriate equipment is available.
- Research Article
- 10.1111/bmsp.12387
- Mar 10, 2025
- The British Journal of Mathematical and Statistical Psychology
- Jonas Bjermo + 2 more
Before items can be implemented in a test, the item characteristics need to be calibrated through pretesting. To achieve high‐quality tests, it's crucial to maximize the precision of estimates obtained during item calibration. Higher precision can be attained if calibration items are allocated to examinees based on their individual abilities. Methods from optimal experimental design can be used to derive an optimal ability‐matched calibration design. However, such an optimal design assumes known abilities of the examinees. In practice, the abilities are unknown and estimated based on a limited number of operational items. We develop the theory for handling the uncertainty in abilities in a proper way and show how the optimal calibration design can be derived when taking account of this uncertainty. We demonstrate that the derived designs are more robust when the uncertainty in abilities is acknowledged. Additionally, the method has been implemented in the R‐package optical.
- Research Article
4
- 10.1111/bjet.13570
- Feb 24, 2025
- British Journal of Educational Technology
- Yunting Liu + 2 more
Effective educational measurement relies heavily on the curation of well‐designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT‐3.5, GPT‐4, Llama 2, Llama 3, Gemini‐Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT‐3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human). Practitioner notes What is already known about this topic The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool. Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low‐cost option for testing the efficacy of educational assessment items. Data augmentation using AI has been an effective strategy for enhancing machine learning model performance. What this paper adds This paper provides the first psychometric analysis of the ability distribution of a variety of open‐source and proprietary LLMs as compared to humans. The study finds that item parameters similar to those produced by 50 undergraduate respondents. Using LLM respondents to augment human response data yields mixed results. Implications for practice and/or policy The moderate performance of LLM respondents by themselves suggests that they could provide a low‐cost option for curating quality items for low‐stakes formative or summative assessments. This methodology offers a scalable way to evaluate vast amounts of generative AI‐produced items.