Articles published on Item bank
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1767 Search results
Sort by Recency
- New
- Research Article
- 10.1177/01466216261420758
- Feb 6, 2026
- Applied Psychological Measurement
- Jonas Bjermo + 2 more
Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.
- New
- Research Article
- 10.3389/fpubh.2026.1760871
- Feb 4, 2026
- Frontiers in Public Health
- Xinxin Wang + 5 more
Background Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear. Methods We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices. Results Overall MCQ accuracy differed significantly across the six chatbots ( p < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0–0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets. Conclusion Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.
- New
- Research Article
- 10.1111/1460-6984.70170
- Feb 1, 2026
- International journal of language & communication disorders
- Eline Alons + 5 more
Communicative participation is the most important outcome of speech and language therapy. A patient-reported outcome measure (PROM) for children would help capture this outcome. Before this PROM can be developed, it is important to find out what situations children themselves find difficult because of their communication problem. The aim of the study was to identify relevant aspects of self-reported communicative participation in children with communication disorders. Thirteen children (5-12 years old) with speech disorders, developmental language disorders (DLDs), voice disorders and/or hearing loss were interviewed with semi-structured interviews. Before the interview they kept a diary for 1 week, documenting participation situations that were difficult because of their communication problem. Within 1 week after completing the diaries, the children were interviewed. In addition, children's ability to recall situations and reflect upon communicative participation was observed. The data analysis was conducted using directed content analysis, drawing on an existing theoretical framework. A total of 171 situations were discussed, leading to the identification of 44 concepts, categorized into the following six categories: person, topic, pace, location, moment and mode. Some of the participants had difficulty recalling situations, and reflecting upon communicative participation. This was particularly true for children under 8 years of age (all with DLD) and two children over 8 years of age with DLD and an indication for a school for children with special needs. The 44 concepts provide insight into the difficulties in communicative participation experienced by children themselves. These concepts will be used to develop a PROM to assess children's communicative participation. What is already known on this subject Communicative participation is the key outcome of speech and language therapy. However, there is a lack of measurement instruments (preferably patient-reported outcome measures, PROMs) to assess communicative participation of children. Additionally, children's own perspectives on their communicative participation, which could inform the development of such an instrument, have not yet been explored. What this paper adds to existing knowledge This study focuses on communicative participation situations as described by children with speech, language and communication needs (SLCN). Based on children's own experiences, 44 concepts describing communicative participation were identified. What are the potential or actual clinical implications of this work? This study enhances a comprehensive understanding of communicative participation from the perspective of children. The identified concepts can already be used in conversations with children about their communicative participation. Additionally, the findings will contribute to the development of an item bank for measuring communicative participation in children with speech, language and communication needs.
- New
- Research Article
- 10.1007/s00405-025-09995-5
- Feb 1, 2026
- European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery
- Berit Schneider-Stickler + 5 more
Injection of Botulinum neurotoxin (BoNT) is regarded as standard treatment for spasmodic dysphonia (SD), reducing the overactivity of the affected muscles. Due to the lack of standardized outcome parameters for diagnosing SD or assessing its treatment over time, the evaluation of systematic clinical evidence on the effects of BoNT therapy on SD symptom control is difficult. The registry presented in this article aimed to evaluate outcomes after BoNT treatment in SD patients in Austria and Germany, based on selected subjective and objective voice parameters. 41 patients with SD were included in this multicentric registry, after drop-out of 2 patients the results of 39/41 (95.1%) patients could be analyzed per protocol. Demographic and treatment characteristics as well as the occurrence of therapy-related side effects were recorded. Perceptual voice sound evaluation (RBH scale), voice range profile measurements (VRP), number of spasms, and severity of voice strain were assessed. Patients were asked to complete the Voice Handicap Index (VHI-9i), the Communicative Participation Item Bank (CPIB), and indicate their perceived phonation effort on a VAS scale. The parameters were compared between baseline and 1-month post-BoNT treatment. The patients received all BoNT type A (Xeomin®/Merz, Allergan®/Allergan or Dysport®/Ipsen). SD symptoms did not affect the frequency or dynamic range of the singing voice nor did they affect MPT; in contrast all the other parameters assessed were aberrant at baseline. BoNT treatment improved Jitter-%, Dysphonia Severity Index (DSI), degrees of roughness and hoarseness, but failed to restore them to normal. BoNT significantly reduced the voice spasm and strain. Phonation effort improved by approximately 50%. VHI-9i classification was reduced from moderate to mild after treatment, and CPIB from moderate to mild. The VHI-9i results were significantly positively correlated with the CPIB results. The results of this registry showed that BoNT injection, the current off-label treatment of SD symptoms, failed to restore normal voice quality. This registry's outcomes also indicate that the choice of the applied BoNT brand is site-specific and does not appear to be associated with differences in efficacy. We also showed that the effects of the BoNT on the SD symptoms are best described by semi-quantitative outcome measures such as spasm counts and voice strain, and by patients-reported outcome measures (PROMs), such as CPIB and VHI-9i than by more objective parameters such as frequency and dynamic range of the singing voice.
- New
- Research Article
- 10.24191/gading.v29i1.701
- Jan 31, 2026
- Gading Journal for the Social Sciences (e-ISSN 2600-7568)
- Che Nooryohana Zulkifli + 4 more
This mixed-methods classroom study evaluates LINK-IT, a tabletop board game for practising English transition signals, focusing on learner engagement, design usability, and refinement priorities. Forty students played the game in small groups during regular class time and subsequently completed a brief post-use questionnaire containing Likert-scale items and two open-ended prompts. Quantitative responses indicated strong motivation, enjoyment, and sustained attention during play, while perceptions of rule clarity and overall usability were positive. Qualitative or thematic analysis of students’ written comments reinforced these results, highlighting novelty, social interaction, and suspense as key drivers of engagement. Students also identified areas for enhancing learning value, including clearer onboarding, more transparent mechanics, and a more deliberate progression of challenge. The discussion integrates both data strands to propose a practical refinement plan that introduces a short demo round and quick-start card to reduce cognitive load, a staged bank of items to align difficulty with learner readiness, light decision-making elements to reward knowledge over chance, and brief “apply and justify” prompts to encourage transfer from recognition to production. Although limited to a single cohort and focused on post-use perceptions rather than performance outcomes, the findings suggest that LINK-IT provides a low-tech, high-interaction complement to writing instruction on cohesion, with clear opportunities for iterative improvement.
- Research Article
- 10.12688/f1000research.132052.2
- Jan 13, 2026
- F1000Research
- Astrid Dahlgren + 10 more
Background Every day we are faced with different treatment claims, in the news, in social media, and by our family and friends. Some of these claims are true, but many are unsubstantiated. Without being supported by reliable evidence such guidance can lead to waste and harmful health choices. The Informed Health Choices (IHC) Network facilitates development of interventions for teaching children and adults the ability to assess treatment claims (informedhealthchoices.org). Our objective was to develop and evaluate a new assessment tool developed from the item bank for use in an upcoming trial of lower secondary school resources in Uganda, Kenya, and Rwanda. Methods A cross-sectional study evaluating a questionnaire including two item-sets was used. The first evaluated ability using multiple-choice questions (scored dichotomously) and the other evaluated intended behaviour and self-efficacy (measured using Likert scales). This study was conducted in Uganda, Kenya, and Rwanda in 2021. We recruited children (over 12 years old) and adults through schools and our networks. We entered 1,671 responses into our analysis. Summary and individual fit to the Rasch model (including Cronbach’s Alpha) were assessed using the RUMM2030 software. Results Both item-sets were found to have good fit to the Rasch model and were acceptable to our target audience. The reliability was good (Cronbach’s alpha >0.7). Observations of the individual item and person fit provided us with guidance on how we could improve the design, scoring, and administration of the two item-sets. There was no local dependency in either of the item-sets, and both item-sets were found to have acceptable unidimensionality. Conclusion Overall, the two item-sets were found to have satisfactory measurement properties. Based on our analysis, we consider these instruments to be suitable for our target audiences in Uganda, Kenya and Rwanda.
- Research Article
- 10.1002/mdc3.70507
- Jan 7, 2026
- Movement disorders clinical practice
- Stephanie M Simone + 2 more
Objective decline in communication abilities following Deep Brain Stimulation (DBS) in People with Parkinson's disease (PwP) is common; however, patient perspectives remain under-investigated. This study examined subjective change in communicative efficacy using the Communicative Participation Item Bank (CPIB) in PwP following DBS. This study examined change in CPIB in 30 PwP following subthalamic DBS using t-tests and examined correlates with cognition. Statistically significant declines in subjective communication efficacy were reported following DBS, t(29) = -3.19, P = 0.003. CPIB decline was not associated with baseline or change in objective language performance after DBS. CPIB decline was not related to laterality of DBS target or pre-DBS cognitive status. Subjective communication decline is common in PwP who undergo STN-DBS, but was not associated with objective measures of language and cognition. Findings support incorporating patient-reported outcomes (eg, CPIB) into DBS evaluations to inform intervention and clarify risk factors for communication decline.
- Research Article
- 10.1177/01466216251415011
- Jan 7, 2026
- Applied psychological measurement
- Yale Quan + 1 more
Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.
- Research Article
- 10.3389/fnbeh.2025.1735237
- Jan 6, 2026
- Frontiers in Behavioral Neuroscience
- Nuno Silva Gonçalves + 2 more
BackgroundEffective feedback in the cognitive domain is essential for surgical education but often limited by resource constraints and traditional assessment formats. Artificial Intelligence (AI) has emerged as a catalyst for innovation, enabling automated feedback, real-time cognitive diagnostics, and scalable item generation, thereby transforming how future surgeons learn and are assessed.MethodsAn item bank of 150 multiple-choice questions was developed using AI-assisted item generation and difficulty estimation. A formative Computerized Adaptive Testing (CAT), balanced across three cognitive domains (memory, analysis, and decision) and surgical topics, was delivered via QuizOne® 3–5 days before the summative Progress Test. A total of 147 students participated, of whom 116 completed the formative CAT. Performance correlations, group comparisons, analysis of covariance (ANCOVA), and regression analyses were conducted.ResultsStudents who voluntarily completed CAT showed higher Progress Test scores, though causality cannot be established due to self-selection bias (p = 0.021), with the effect persisting after adjusting for prior academic performance (ANCOVA p = 0.041). Memory skills were the strongest predictors of summative outcomes (R2 = 0.180, β = 0.425), followed by analysis (R2 = 0.080, β = 0.283); decision was not significant (R2 = 0.029, β = 0.170).ConclusionAI-enhanced CAT–Cognitive Diagnostic Modeling (CDM) represents a promising formative approach in undergraduate surgical education, being associated with higher summative performance and providing individualized diagnostic feedback. Refining feedback presentation and enhancing decision-making assessment could further optimize its educational impact.
- Research Article
- 10.1080/21678421.2025.2598433
- Dec 13, 2025
- Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration
- Abigail E Haenssler + 11 more
Objective: Bulbar dysfunction often diminishes the accuracy and speed of the tongue, lip, and jaw movements necessary for speech production. Vowel acoustic features derived from speech recordings can serve as sensitive markers of articulatory accuracy and movement timing. We examined whether degraded speech caused by amyotrophic lateral sclerosis (ALS), assessed through vowel acoustic features, was associated with communicative participation restrictions. As a secondary aim, we assessed the association of two global speech characteristics, rate and intelligibility, with vowel features and communicative participation. Materials & Methods: Thirty-three people with ALS (plwALS) recorded a reading passage and completed surveys using a smartphone application. Speaking rate and acoustic vowel features (duration, vowel articulation index [VAI]) were extracted from the recordings. Three speech-language pathologists rated speech intelligibility. Communicative participation was assessed using the Communicative Participation Item Bank (CPIB) short form. Bivariate correlation, partial correlation, and regression analyses were used to evaluate the associations between vowel features, intelligibility, speaking rate, and CPIB scores. Results: Significant bivariate correlations, ranging from rs = −0.39 to rs = 0.64, were found between speech variables and CPIB scores. A combined regression model including VAI, vowel duration, and sex explained 52% of the variance in CPIB scores. Including speaking rate or intelligibility in the partial correlation analysis attenuated the associations between vowel acoustics and CPIB. Conclusions: Vowel features and global dysarthria characteristics are linked to communicative participation in ALS. Clinical practices designed to target vowel production, speaking rate, and intelligibility may help to maintain daily communication in ALS.
- Research Article
- 10.1093/rheumatology/keaf649
- Dec 12, 2025
- Rheumatology (Oxford, England)
- Antonin Satrin + 38 more
Patient education is increasingly acknowledged as an important aspect of the management of systemic lupus erythematosus (SLE). The aim of the study was to develop the SLE Knowledge Assessment score (SLAKE), a digital multilingual self-assessment tool designed to quantify essential SLE knowledge. International healthcare professionals (HCPs) and patient representatives engaged in a multi-step process to: identify essential SLE knowledge domains, select key domains via rating, and generate an item bank of 394 questions across 11 domains, which was then adapted into 19 languages. For validation, participants completed 44 questions (including 33 randomly selected), with scores calculated for total knowledge and the 11 specific domains. Statistical analyses examined associations between scores and demographic, clinical, and educational variables. SLAKE was used by 1182 SLE participants (1120 [94.8%] women, median age: 45 years [IQR: 35-54 years]), with a median SLE duration of 10 years (IQR: 4-20 years). The median SLAKE score was 37 (IQR: 34-40) of a maximum of 44 points while the median score across the 11 SLAKE domains ranged between 3 and 4 over a maximum of 4 points. There was a significant positive association between SLAKE score and SLE duration (p= 0.006), previous participation to a patient education course or a patient training for lupus (p< 0.0001) and the education level (p< 0.0001) but not with age (p= 0.48) or gender (p= 0.39). SLAKE is a valid, multilingual, digital self-assessment tool that effectively measures essential SLE knowledge. Its randomized question bank and domain-specific scoring enable targeted education, ultimately supporting better disease management.
- Research Article
- 10.1002/mdc3.70305
- Dec 12, 2025
- Movement disorders clinical practice
- Pablo Rábano-Suárez + 19 more
There is a need to better evaluate Parkinson's disease (PD) impact on daily lives of People with Parkinson's (PwP). To conceptualize a novel digital tool, the MDS PD e-Diary that has been commissioned by the International Parkinson's disease and Movement Disorders Society. Using a modified Delphi methodology, we sought consensus among key stakeholders (PwP, care partners, PD specialists, industry, regulatory representatives) through online questionnaires, focus groups, and a broad community survey. The consensus resulted in a multiplatform patient-reported outcome tool to track PD progression. It includes an Item Bank of symptoms and activities featuring two interconnected user modes. The personal mode is a customizable self-tracking tool that allows data sharing with professionals to improve standard care. The research mode employs a predefined responsive item to enhance research and clinical trials. The MDS PD e-Diary was designed to capture PD progression and its impact on PwP lives, potentially transforming research and clinical practice. Its further development and validation processes are underway.
- Research Article
- 10.1136/bmjopen-2024-098050
- Dec 2, 2025
- BMJ Open
- Ryan Statton + 4 more
ObjectivesThis study aims to develop a robust, targeted measure of patient experiences of person-centred care (PCC), informed by the lived experiences of patients with chronic illness using the psychometric theory of Rasch measurement.DesignThe Rasch measurement model was used to analyse the psychometric functioning of 57 candidate items and select appropriate items for a targeted measure.SettingParticipants were recruited from Prolific.com, having experience of both chronic or long-term illness and first-hand experience of primary or secondary care in the UK healthcare setting and completed a survey containing PCC items and descriptors of healthcare experience.ParticipantsData from 501 adult persons (49.5% men and 49.7% women) with different types of long-term conditions recruited from the prolific web panel.ResultsFor an initial analysis of all 57 candidate items, there were several indicators of misfit, such as signs of local dependence and multidimensionality. The response options worked as intended according to threshold ordering. After removal of misfitting items and refinement for the best spread of locations, a 14-item solution showed good fit to the Rasch model in this UK sample.ConclusionsThe results support a unidimensional measurement of patients’ experiences of PCC, once the local dependency was accommodated. The present work thus offers a 14-item measure of PCC experience. The present work also contains a robust item bank for the further development of dynamic computerised adaptive testing.
- Research Article
- 10.1002/alz70857_107239
- Dec 1, 2025
- Alzheimer's & dementia : the journal of the Alzheimer's Association
- Richard C Gershon + 5 more
Despite Spanish being the second most spoken language in the U.S., few cognitive assessments are available in Spanish for large-scale studies with older adults. Culturally inclusive measures that promote early diagnosis and identify modifiable risk factors for Alzheimer's Disease and ADRD are of utmost importance. The Mobile Toolbox (MTB) provides brief, sensitive measures for assessing neurological and behavioral functions across the adult lifespan, aiding large-scale studies on cognitive functioning and the development of ADRD. MTB integrates with the REDCap system and MyCap Mobile App, used by over 7,600 institutions in 160 countries for remote study management and delivery. The English MTB tests, released in 2024, are valid and reliable across diverse samples. In January 2025, Spanish cognitive measures were added. In this presentation, we introduce the new Spanish tests, present the Spanish Word Meaning calibration study results, and demonstrate platform usage. Seven of the eight English versions of the MTB cognitive measures were deemed suitable for adaptation and developed into Spanish using a team of native Spanish speakers. When appropriate, test stimuli were updated to be more culturally sensitive. A calibration study with 1,620 Spanish-speaking adults (Mean Age=43.79, SD=14.45) refined the Spanish Word Meaning item bank, crucial for developing an accurate computer adaptive test (CAT) for Spanish vocabulary. The MTB library includes Spanish cognitive tests assessing language (Word Meaning), executive functioning (Arrow Matching; Shape-Color Sorting), associative memory (Faces and Names), episodic memory (Arranging Pictures), working memory (Sequences) and processing speed (Numbers Symbol Match). The Spanish Word Meaning test used the Rasch model for consistency with its English counterpart, after removing 88 poorly fitting items. The final pool contained 515 items, with difficulty parameters ranging from -1.896 to 2.074. The MTB addresses various scientific, practical, and technical challenges in cognitive assessment by leveraging advances in technology, measurement, and cognitive research. It is suitable for a wide range of studies, including large-scale research, clinical research, and pharmaceutical studies, particularly those interested in incorporating point-in-time and burst designs, as well as ecological momentary assessment (EMA). By offering tests in English and Spanish, MTB can support research with diverse populations.
- Research Article
- 10.2196/76544
- Nov 26, 2025
- JMIR Formative Research
- Kenneth Mcclure + 5 more
BackgroundIntensive longitudinal designs support temporally granular study of processes, making methods like ecological momentary assessment (EMA) increasingly common in medical and behavioral science. However, the repetitive and intensive measurement strategies associated with these designs increase participant burden, which limits the breadth and precision of EMA surveys. This is particularly problematic for complex clinical phenomena, such as suicide risk, which research has shown is multidimensional and fluctuates over narrow time intervals (eg, hours). To overcome this limitation, we proposed the Computerized Adaptive Test for Suicide Risk Pathways (CAT-SRP), which supports the simultaneous assessment of multiple empirically informed risk domains and facilitates personalized survey content.ObjectiveThe objective of this study is to develop, calibrate, and pilot the first multidimensional computerized adaptive test for suicidal thoughts and related psychosocial risk factors in intensive longitudinal designs like EMA.MethodsA web-based assessment platform was developed to adaptively administer the CAT-SRP. CAT-SRP items were modified from existing validated instruments to support administration in intensive longitudinal designs. The item bank was developed in line with major ideation-to-action theories of suicide and consultation with experts outside the study team. Exploratory item factor analysis was used to identify dimensionality of the item bank. Item parameters were calibrated using a multidimensional graded response model in a large cross-sectional community sample (n=1759, 36.33% with a history of suicidal thoughts). Following calibration, the CAT-SRP was evaluated in an EMA study of participants with a past month history of suicidal thoughts (n=29 across 2134 observations). Adaptive testing used D-optimal item selection, a dual variable-length stopping criterion, and Maximum a Posteriori (MAP) scoring. Descriptive statistics and mixed effects models were used to examine CAT-SRP performance (eg, efficiency and survey overlap) and relationships among CAT-SRP domain scores.ResultsThe calibration study identified 2 suicidal thought domains (active and passive thoughts) and 12 risk factor domains: humiliation, loneliness, anger, pain, defeat, impulsivity (ie, negative urgency), entrapment, distress tolerance, perceived burdensomeness, thwarted belongingness, aggression, and a positively valenced method factor. Domain information was the highest between average to high levels of domain scores. Study 2 showed that the CAT-SRP (1) administered surveys with low to moderate item overlap, (2) incurred low participant burden, and (3) may improve near-term prediction of suicidal thoughts relative to traditional EMA measurement. Most EMA surveys reached the maximum length, 50 questions, highlighting a need to refine selection and stopping rules.ConclusionsThe CAT-SRP effectively personalized EMA survey content to respondents, which reduces the repetitiveness and perceived burden of intensive longitudinal research designs. Continuous domain scores from multidimensional computerized adaptive testing (MCAT) also provided more nuanced measurement compared to traditional approaches that struggle with zero-inflation in EMA and appeared to produce stronger predictive relationships. Overall, the CAT-SRP demonstrated strong methodological advantages to use CAT for intensive longitudinal data collection.
- Research Article
- 10.35329/fkip.v21i2.6040
- Nov 16, 2025
- Pepatudzu : Media Pendidikan dan Sosial Kemasyarakatan
- Ulya Nur Alim + 4 more
Evaluation in Arabic language learning is essential to measure students' achievement; however, the quality of tests used in Indonesia still requires improvement. This study employed the Systematic Literature Review (SLR) method to analyze the validity, reliability, difficulty level, and discrimination power of Arabic test items, based on a synthesis of six articles indexed in SINTA and Scopus, published between 2019 and 2024. This SLR approach offers a new contribution by systematically revealing national trends and gaps in item quality, which have not been comprehensively analyzed in previous studies. The findings show that on average, 66% of the items were valid, and most tests demonstrated very high reliability (≥ 0.85), although some tests had low reliability (0.54). The distribution of difficulty levels was imbalanced, with 50.83% of items being too easy and only 7.67% classified as difficult, deviating from the ideal distribution. Additionally, 34% of the items exhibited low discrimination power, reducing the effectiveness of assessments in distinguishing students' abilities. These imbalances can lead to biased evaluations and hinder students' competency development. The practical implications of this study include the importance of teacher training in item analysis, the application of Bloom's Taxonomy to balance item difficulty levels, and the development of a standardized, data-driven item bank. The main contribution of this research is to provide empirical foundations for improving Arabic language assessment policies in Indonesia and to propose a more accurate and fair evidence-based evaluation approach.
- Research Article
- 10.58578/ajstea.v3i6.7603
- Nov 15, 2025
- Asian Journal of Science, Technology, Engineering, and Art
- Anthony A Arogbofa
Childhood trauma is a well-established risk factor for cognitive impairments; however, its impact on Nigerian adolescents remains underexplored. This study examined the relationship between childhood trauma and cognitive functioning among in-school adolescents in Gwagwalada, Nigeria. Employing a cross-sectional design, 240 students (120 males, 120 females; M = 14.5 years, SD = 1.47) were randomly selected from four secondary schools. Trauma exposure was measured using the Childhood Trauma Questionnaire (CTQ-28), while cognitive functioning—specifically memory, attention, and executive function—was assessed using the PROMIS Pediatric Item Bank v1.0. Results revealed that 75% of participants reported experiencing at least one form of trauma, with emotional abuse being the most prevalent (45%). Correlation and regression analyses showed a significant negative association between trauma exposure and cognitive functioning (r = –0.42, p < 0.01), with emotional abuse emerging as the strongest predictor (β = –0.35, p < 0.01). ANOVA results confirmed significant group differences, indicating that emotional abuse predicted the most severe cognitive deficits (F(4, 235) = 9.12, p < 0.01). Gender emerged as a moderating variable, with female adolescents exhibiting heightened vulnerability to trauma-related cognitive impairments. These findings underscore the critical need for trauma-informed educational and clinical interventions tailored to adolescent populations in Nigeria. The study recommends further research into the neurobiological pathways of trauma and the development of culturally responsive support systems for affected youth.
- Research Article
- 10.54380/ijrdet1125_69
- Nov 12, 2025
- International Journal of Recent Development in Engineering and Technology
- V V Subrahmanyam + 1 more
Generative Artificial Intelligence (GenAI) is introducing advanced computational techniques to educational assessments, enabling scalable, adaptive, and context-aware evaluation mechanisms. Unlike conventional assessment systems, which rely on static item banks, GenAI leverages deep learning architectures, particularly transformer-based language models (e.g., GPT, LLaMA, Claude), to generate contextually relevant questions, model diverse difficulty levels, and simulate authentic problemsolving scenarios. Automated question generation, distractor design, and rubric-based grading have been enhanced using Natural Language Processing (NLP) and semantic similarity algorithms, significantly reducing human intervention in assessment design. Several AI-driven assessment tools are gaining traction in academic and corporate learning environments. Gradescope and EvalAI use machine learning for automated grading and plagiarism detection; QuestionWell and Quillionz employ generative models to create test items and quizzes; while Inspera Assessment integrates AI-powered analytics for adaptive testing. Emerging systems such as ChatGPT-based tutors and Otter.ai for real-time transcription are also being explored for formative assessment and feedback generation. However, critical technical challenges persist, including maintaining reliability, explainability of scoring algorithms, mitigation of bias in generated content, and ensuring secure data handling. This paper discusses the underlying GenAI architectures, reviews contemporary AI assessment tools, and outlines future research directions for developing transparent, ethically aligned, and pedagogically sound AI-driven assessment systems.
- Research Article
- 10.24143/2072-9502-2025-4-122-130
- Nov 10, 2025
- Vestnik of Astrakhan State Technical University. Series: Management, computer science and informatics
- Elena Lvovna Medyankina
The article discusses the theoretical basis of automated testing, highlighting various approaches to conducting tests using special software tools and their impact on assessment results. The focus has shifted from creating a large bank of test items to individualizing testing for each student. Given that advancements in machine learning have led to significant changes in adaptive testing, there is an opportunity to apply modern methods to create tests that are tailored to the specific needs and knowledge levels of each student. The article also explores the process of selecting a suitable platform for implementing an adaptive model. This process includes defining key requirements, creating a job database structure, and developing a testing system. The article presents the structure of the model, with a description of all its components, and the adaptive testing algorithm, which is based on the grading of tasks by levels of learning, reflecting the depth of understanding, developed by educational theorists. This approach aims to improve the accuracy of assessing students' knowledge by individualizing the testing process. As an example, the article provides tasks based on this grading, which can be modified and applied to various academic disciplines, as well as a pseudo code example that can be adapted to the chosen programming language and task requirements. Thus, the implementation of the results of this study will have a positive impact on the improvement and optimization of the entire educational process.
- Research Article
- 10.3389/feduc.2025.1639273
- Nov 6, 2025
- Frontiers in Education
- Yuanyuan Wang + 2 more
Introduction Educators need real time evidence of how students process pre class quiz items in flipped courses, not just whether answers are right or wrong. We examined whether two classroom feasible eye tracking metrics—fixation intensity (total dwell time) and regression rate (proportion of backward saccades)—provide interpretable, item level signals of cognitive engagement once surface text features are taken into account. Methods Thirty four undergraduates completed 320 analysable attempts on 55 multiple choice items coded by Bloom’s taxonomy while a 60 Hz tracker recorded gaze. Crossed mixed effects models included a covariate for each item’s total word count. A logistic mixed model tested whether fixation intensity and regression rate predicted correctness beyond Bloom level, gender, and length. After each block, students reported perceived mental effort to compare subjective and gaze based indicators. Results After controlling for total word count, Bloom category did not uniquely predict fixation intensity or regression rate, suggesting that previously observed demand patterns largely reflected text length. In the accuracy model, fixation intensity showed a small, positive association with being correct, whereas regression rate showed a small, negative association. Discussion In authentic flipped class quizzes, fixation intensity and regression rate can serve as complementary, real time indicators of engagement, but only when item length and layout are standardised or statistically modelled. Claims about differences across Bloom levels should be made cautiously. We outline design guidance for future item banks—length matched stems, fixed numbers of options, and pre registered word count covariates—to enable firmer inferences and practical classroom diagnostics.