Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Developing NOVA – A Next-Generation Open Vocabulary Assessment

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Abstract: In psychological assessment, vocabulary tests are commonly used as reliable and efficient indicators of crystallized intelligence, as retrospective proxies for premorbid intelligence, and as measures of language proficiency. However, many of the widely used German vocabulary tests are outdated, proprietary, and lack a clear rationale for item selection. To address these limitations, we developed a new, openly available vocabulary test: the Next-Generation Open Vocabulary Assessment (NOVA). We constructed 110 multiple-choice vocabulary items with support from ChatGPT and administered them to 1,052 German-speaking adults using a multiple-matrix design, along with a declarative knowledge test for validation purposes. Using Ant Colony Optimization, we assembled two parallel 30-item short forms by optimizing reliability as well as item difficulty and discrimination parameters within a three-parameter logistic item response model. The resulting test forms provided unidimensional and reliable measurement, covered a broad ability range, showed no gender differences, and correlated strongly with declarative knowledge. A Shiny app is available to calculate norm-referenced scores based on individual test results. Additional analyses revealed that 61.2% of the variance in item difficulty was explained by word frequency and word length, underscoring their potential utility in guiding future vocabulary test development.

Similar Papers
  • Preprint Article
  • Cite Count Icon 1
  • 10.31234/osf.io/vhakw_v1
Developing NOVA: A Next-Generation Open Vocabulary Assessment
  • Feb 6, 2025
  • Ulrich Schroeders + 1 more

In psychological assessment, vocabulary tests are commonly used as reliable and efficient indicators of crystallized intelligence, as retrospective proxies for premorbid intelligence, and as measures of language proficiency. However, many of the widely used German vocabulary tests are outdated, proprietary, and lack a clear rationale for item selection. To address these limitations, we developed a new, openly available vocabulary test: the Next-Generation Open Vocabulary Assessment (NOVA). Therefore, we first constructed 110 multiple-choice vocabulary items with support from ChatGPT and administered them to 1,052 German-speaking adults using a multiple-matrix design, along with a declarative knowledge test for validation purposes. In a second step, we used Ant Colony Optimization to compile two parallel 30-item short forms, optimized for reliability and item difficulty and discrimination. The resulting tests assessed vocabulary unidimensionally and reliably, covered a large ability range, and correlated strongly with declarative knowledge. We provide a Shiny app for the calculation of standard values based on individual test results. Additional analyses revealed that 57% of the variance in item difficulties could be explained by word frequency and word length, which may be particularly useful for streamlining the future development of vocabulary tests.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.3758/s13428-022-01918-0
What makes domain knowledge difficult? Word usage frequency from SUBTLEX and dlexDB explains knowledge item difficulty
  • Aug 1, 2022
  • Behavior Research Methods
  • Ulrich Ludewig + 3 more

The quality of tests in psychological and educational assessment is of great scholarly and public interest. Item difficulty models are vital to generating test result interpretations based on evidence. A major determining factor of item difficulty in knowledge tests is the opportunity to learn about the facts and concepts in question. Knowledge is mainly conveyed through language. Exposure to language associated with facts and concepts might be an indicator of the opportunity to learn. Thus, we hypothesize that item difficulty in knowledge tests should be related to the probability of exposure to the item content in everyday life and/or academic settings and therefore also to word frequency. Results from a study with 99 political knowledge test items administered to N = 250 German seventh (age: 11–14 years) and tenth (age: 15–18 years) graders showed that word frequencies in everyday settings (SUBTLEX-DE) explain variance in item difficulty, while word frequencies in academic settings (dlexDB) alone do not. However, both types of word frequency combined explain a considerable amount of the variance in item difficulty. Items with words that are more frequent in both settings and, in particular, relatively frequent in everyday settings are easier. High word frequencies and relatively higher word frequency in everyday settings could be associated with higher probability of exposure, conceptual complexity, and better readability of item content. Examining word frequency from different language settings can help researchers investigate test score interpretations and is a useful tool for predicting item difficulty and refining knowledge test items.

  • Research Article
  • 10.1177/01466216251316276
Compound Optimal Design for Online Item Calibration Under the Two-Parameter Logistic Model.
  • Jan 28, 2025
  • Applied psychological measurement
  • Lihong Song + 1 more

Under the theory of sequential design, compound optimal design with two optimality criteria can be used to solve the problem of efficient calibration of item parameters of item response theory model. In order to efficiently calibrate item parameters in computerized testing, a compound optimal design is proposed for the simultaneous estimation of item difficulty and discrimination parameters under the two-parameter logistic model, which adaptively focuses on optimizing the parameter which is difficult to estimate. The compound optimal design using the acceptance probability can provide ability design points to optimize the item difficulty and discrimination parameters, respectively. Simulation and real data analysis studies showed that the compound optimal design outperformed than the D-optimal and random design in terms of the recovery of both discrimination and difficulty parameters.

  • Research Article
  • Cite Count Icon 73
  • 10.1177/0146621603256021
Variance Estimation for Converting MIMIC Model Parameters to IRT Parameters in DIF Analysis
  • Sep 1, 2003
  • Applied Psychological Measurement
  • Randall Macintosh + 1 more

The purpose of this study is to document the delta method to compute the standard error of the estimates of the converted item response theory (IRT) discrimination and difficulty parameters derived from multiple-indicator, multiple-causes (MIMIC) model parameters. Discussed is the formulation of MIMIC models to explore differential item functioning in Mplus and how to obtain factor-analytic estimates that are converted easily into IRT parameters. Also described are the partial derivatives necessary to apply the delta method to estimate variances for the converted parameters. Both item difficulty and discrimination parameters estimated from MIMIC parameters were very close to the Multilog estimates. The variance estimates for most parameters were similar as well.

  • Research Article
  • Cite Count Icon 11
  • 10.1111/j.1745-3984.1991.tb00360.x
Use of Restricted Item Response Models for Examining Item Difficulty Ordering and Slope Uniformity
  • Dec 1, 1991
  • Journal of Educational Measurement
  • Suzanne Lane

This article demonstrates the utility of restricted item response models for examining item difficulty ordering and slope uniformity for an item set that reflects varying cognitive processes. Twelve sets of paired algebra word problems were developed to systematically reflect various types of cognitive processes required for successful performance. This resulted in a total of 24 items. They reflected distance‐rate–time (DRT), interest, and area problems. Hypotheses concerning difficulty ordering and slope uniformity for the items were tested by constraining item difficulty and discrimination parameters in hierarchical item response models. The first set of model comparisons tested the equality of the discrimination and difficulty parameters for each set of paired items. The second set of model comparisons examined slope uniformity within the complex DRT problems. The third set of model comparisons examined whether the familiarity of the story context affected item difficulty for two types of complex DRT problems. The last set of model comparisons tested the hypothesized difficulty ordering of the items.

  • Research Article
  • Cite Count Icon 40
  • 10.1044/2015_jslhr-l-14-0249
Item Response Theory Modeling of the Philadelphia Naming Test.
  • Mar 25, 2015
  • Journal of Speech, Language, and Hearing Research
  • Gerasimos Fergadiotis + 2 more

In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating explanatory variables to item difficulty. This article describes the statistical model underlying the computer adaptive PNT presented in a companion article (Hula, Kellough, & Fergadiotis, 2015). Using archival data, we evaluated the fit of the PNT to 1- and 2-parameter logistic models and examined the precision of the resulting parameter estimates. We regressed the item difficulty estimates on three predictor variables: word length, age of acquisition, and contextual diversity. The 2-parameter logistic model demonstrated marginally better fit, but the fit of the 1-parameter logistic model was adequate. Precision was excellent for both person ability and item difficulty estimates. Word length, age of acquisition, and contextual diversity all independently contributed to variance in item difficulty. Item-response-theory methods can be productively used to analyze and quantify anomia severity in aphasia. Regression of item difficulty on lexical variables supported the validity of the PNT and interpretation of anomia severity scores in the context of current word-finding models.

  • Research Article
  • 10.52589/bjeldp-4skvbgua
Assessing Item Difficulty, Discrimination, Guessing, and Carelessness Parameters of the Mathematics Achievement test for Secondary School Students in Edo State, Nigeria
  • Jul 30, 2025
  • British Journal of Education, Learning and Development Psychology
  • Afemikhe, O A + 1 more

This study assessed the psychometric properties of the Mathematics Achievement test for Secondary School Students in Edo State, Nigeria, using the four-parameter logistic model (4PLM) of Item Response Theory (IRT). The study adopted a descriptive survey design. The population comprised students from 312 public junior secondary schools in Edo State, while the sample consisted of 2,204 students selected from this population. The research instrument was a 40-item multiple-choice Mathematics Achievement developed by Afemikhe and Imasuen (2024). The instrument, previously validated and standardized, had a reliability coefficient of 0.89 using the Kuder-Richardson Formula 20 (KR-20). Unidimensionality of the data was verified through Principal Component Analysis using SPSS, while item calibration was conducted with Jmetrik IRT software to estimate item difficulty, discrimination, guessing, and carelessness parameters. The results revealed that most items demonstrated very high discrimination, indicating a strong capacity to differentiate between students with high and low levels of achievement in mathematics. Most items were difficult, suggesting that the test provided sufficient challenge for students. However, a high proportion of items displayed elevated guessing parameters, reflecting issues with distractor quality. On the positive side, carelessness was generally low, suggesting that students responded attentively. Based on the findings, it was recommended that the distractors of test items of the test be reviewed and improved to reduce guessing and that IRT frameworks be more widely adopted in the evaluation of educational assessments.

  • Research Article
  • Cite Count Icon 14
  • 10.15288/jsad.2011.72.981
Modeling the Severity of Drinking Consequences in First-Year College Women: An Item Response Theory Analysis of the Rutgers Alcohol Problem Index
  • Nov 1, 2011
  • Journal of Studies on Alcohol and Drugs
  • Amy M Cohn + 3 more

The present study examined the latent continuum of alcohol-related negative consequences among first-year college women using methods from item response theory and classical test theory. Participants (N = 315) were college women in their freshman year who reported consuming any alcohol in the past 90 days and who completed assessments of alcohol consumption and alcohol-related negative consequences using the Rutgers Alcohol Problem Index. Item response theory analyses showed poor model fit for five items identified in the Rutgers Alcohol Problem Index. Two-parameter item response theory logistic models were applied to the remaining 18 items to examine estimates of item difficulty (i.e., severity) and discrimination parameters. The item difficulty parameters ranged from 0.591 to 2.031, and the discrimination parameters ranged from 0.321 to 2.371. Classical test theory analyses indicated that the omission of the five misfit items did not significantly alter the psychometric properties of the construct. Findings suggest that those consequences that had greater severity and discrimination parameters may be used as screening items to identify female problem drinkers at risk for an alcohol use disorder.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 69
  • 10.2174/1874350100902010094
What does the Mental Rotation Test Measure? An Analysis of Item Difficulty and Item Characteristics
  • Dec 11, 2009
  • The Open Psychology Journal
  • André F Caissie + 2 more

The present study examined the contributions of various item characteristics to the difficulty of the individual items on the Mental Rotation Test (MRT). Analyses of item difficulties from a large data set of university students were conducted to assess the role of time limitation, distractor type, occlusion, configuration type, and the degree of angular disparity. Results replicated in large part previous findings that indicated that occluded items were significantly more difficult than non-occluded and that mirror items were more difficult than structural items. An item characteristic not previously examined in the literature, configuration type (homogeneous versus heterogeneous), also was found to be associated with item difficulty. Interestingly, no significant association was found between angular disparity and difficulty. Multiple regression analysis revealed that a model consisting of occlusion and configuration type alone was sufficient for explaining 53 percent of the variance in item difficulty. No interaction between these two factors was found. It is suggested, based on overall results, that basic figure perception, identification and comparison, but not necessarily mental rotation, account for much of the variance in item difficulty on the MRT.

  • Dissertation
  • 10.15760/honors.962
The Utility of Multiplex Closeness Centrality for Predicting Item Difficulty Parameters in Anomia Tests
  • Nov 20, 2020
  • Khanh Nguyen

Background: Confrontation naming tests for the assessment of aphasia are perhaps the most commonly used tests in aphasiology. Recently, such tests have been modeled using item response theory approaches. Despite their advantages, item response theory models require large sample sizes for parameter estimation that are often unrealistic when working with clinical populations. As an alternative approach, Fergadiotis, Kellough & Hula (2015) explored automatic item calibration by regressing item difficulty parameters on word length, age of acquisition (AOA), and lexical frequency as quantified by the Log10CD index. Despite the high predictive utility that they achieved, the model’s performance was far from perfect (R2= .63) which may carry implications for the accuracy of any difficulty parameters derived by the model. Purpose: This study aims to examine the addition of a fourth psycholinguistic variable to the regression model, multiplex closeness centrality (MCC). It is hypothesized that the ability to capture how well-connected words are in the human lexicon would make MCC a potential indicator of semantic processing which would contribute to the predictive utility of the model. Method: A multiple regression analysis was carried out with the Philadelphia Naming Test item difficulty parameters as the dependent variable, and lexical frequency, AOA, word length, and MCC as the predictors. Item difficulty parameters were estimated based on a traditional calibration approach. Results & Conclusions: Our analysis showed a high correlation between MCC and item difficulty and suggested that the addition of MCC has allowed the model to account for more variance. However, the change between the model with three variables and the one with four variables, including MCC, was not statistically significant. In other words, MCC did not add unique information to the regression model despite the high correlation with item difficulty due to the overlapping variance of MCC with other predictors. However, the findings should be interpreted cautiously because of a large number of missing values in MCC. Post hoc analyses indicated that data were missing not at random which might have contributed to the lack of significant findings. Thus, we suggest that future research investigate this type of study using a complete dataset and appropriately apply the missing data theory to their analysis.

  • Research Article
  • 10.58194/jetli.v4i2.2998
Measuring the Quality of Teacher-Constructed English Test as Final Examination through Item Response Theory
  • Jul 31, 2025
  • Journal of English Teaching and Linguistic Issues (JETLI)
  • Novri Pahrizal + 1 more

This study aimed to examine the psychometric quality of a teacher-constructed English final examination test for Grade X students at Senior High School in Sungai Penuh using the framework of Item Response Theory (IRT). The analysis focused on evaluating model fit, item difficulty, and item discrimination parameters across the 1-Parameter Logistic (1-PL) and 2-Parameter Logistic (2-PL) models. Data was collected from students’ responses to 40 multiple-choice items and analyzed using RStudio. The goodness-of-fit results revealed that the 2-PL model provided a better representation of the data, with 36 items classified as fit and only 3 misfitting, compared to the 1-PL model where 32 items fit and 8 misfits. Furthermore, the difficulty parameter (b) indicated that all items were within the acceptable range (–2 ≤ b ≤ +2), with a tendency toward easy to moderate levels. The discrimination parameter (a) demonstrated that most items possessed satisfactory to high discrimination power, although a small number exhibited lower values. These findings confirm that the teacher-constructed test generally meets psychometric standards of validity and reliability, while also highlighting the need for revision of a few misfitting and low-discrimination items. The study provides both theoretical and practical contributions by emphasizing the importance of applying IRT in school-based assessment practices to ensure fair, accurate, and effective evaluation of students’ learning outcomes.

  • Research Article
  • Cite Count Icon 2
  • 10.37244/ela.2021.16.1.59
Predicting the Item Difficulty of a Simulated CSAT English Test Based on Corpus Analysis
  • Jun 30, 2021
  • The Korea English Language Testing Association
  • Hye Rang Om

This study investigates the relationship between linguistic features and item difficulty of a simulated College Scholastic Ability Test (CSAT) English test, based on a corpus analysis. The test used for the present study was a simulated CSAT English test administered in June 2020. Item difficulty data was collected from 101,386 students who took the test. For the corpus analysis, lexical and syntactic variables were measured by the Lexical Complexity Analyzer (LCA) and the L2 Syntactic Complexity Analyzer (L2SCA), the computational tool and were correlated with item difficulty (dependent variable) for 41 test items. According to the correlation analysis, one lexical variable and all syntactic variables were found to be significantly correlated with item difficulty. Also, the results of the multiple regression indicate that lexical sophistication and particular structures are related to item difficulty, explaining for approximately 55.1% of the variance in item difficulty. The results showed that the variables identified in the current study were explanatory in terms of predicting item difficulty of the CSAT English test. Therefore, the findings of this study have pedagogical implications for test developers and education policy makers in Korea, with regard to text characteristics and test difficulty.

  • Research Article
  • Cite Count Icon 1
  • 10.3724/sp.j.1041.2013.01179
The Item Parameters’ Estimation Accuracy of Two-Parameter Logistic Model
  • Dec 13, 2013
  • Acta Psychologica Sinica
  • Wenjiu Du + 2 more

大部分是基于在已知项目参数真值的情况下, 运用各种参数估计方法产生新的估计值, 再和真 值进行偏度(BIAS)和均方根差(RMSE)的比较, 从而说明该种估计方法的有效性。但是这种方法不能提供不 同的参数真值之间的估计误差的变化规律。 为了弥补这一缺陷, 本文尝试从项目参数估计信息函数的角度出 发研究项目参数的估计精度问题。本研究以二参数 Logistic 模型作为研究对象, 首先定义了项目参数的估计 信息函数, 然后基于完全随机实验设计, 通过模拟研究的方法探索影响项目参数的估计精度的因素, 实验共 设计了(2×3×2)种情形。研究结果表明:(1)项目参数(a,b)的估计精度均随着被试样本量的增大而提高; (2)被 试的能力分布对难度参数的估计精度影响较大, 对区分度参数的估计精度影响相对较小; (3)难度参数和区 分度参数的估计精度都分别受到参数 a 和参数 b 的共同作用。

  • Research Article
  • Cite Count Icon 44
  • 10.1080/15305058.2001.9669470
Combining Multiple Regression and CART to Understand Difficulty in Second Language Reading and Listening Comprehension Test Items
  • Sep 1, 2001
  • International Journal of Testing
  • Andre A Rupp + 2 more

Identifying sources of item difficulty allows test developers to better define the constructs that they are testing by providing empirical evidence. Past research has explored item difficulty of reading and listening comprehension items using multiple regression and classification and regression tree (CART) analyses (e.g., Freedle & Kostin, 1993, 1996, 1999; Sheehan, 1997); however, few studies have combined these techniques so that practitioners can evaluate their relative contributions to our understanding. In this study, 214 computerized reading and listening comprehension items completed by 87 nonnative English speakers of varying ability levels were analyzed using a two-fold approach. First, item difficulty was modeled as a function of 12 text and item and text interaction predictor variables in a multiple linear regression model. Seven of the 12 variables in the model accounted for about 31% of the variance in item difficulty as measured by the adjusted R2. Second, the data were used to build multiple regression tree models using CART, a nonparametric technique that uncovers linear dependencies among predictor variables. Seven variables, some of which were not identified in the regression analysis, were relatively important across all trees. We found that synthesizing results from the 2 methodological perspectives provided a richer picture of the interrelations of variables that affect item difficulty, and provided some empirical support for our construct definitions.

  • Research Article
  • Cite Count Icon 28
  • 10.1002/j.2333-8504.1982.tb01308.x
THE EFFECT OF THE POSITION OF AN ITEM WITHIN A TEST ON ITEM RESPONDING BEHAVIOR: AN ANALYSIS BASED ON ITEM RESPONSE THEORY
  • Jun 1, 1982
  • ETS Research Report Series
  • Neal M Kingston + 1 more

ABSTRACTThe research described in this paper deals solely with the effect of the position of an item within a test on examinee's responding behavior at the item level. For simplicity's sake, this effect will be referred to as practice effect when the result is improved examinee performance and as fatigue effect when the result is poorer examinee performance. Item response theory item statistics were used to assess position effects because, unlike traditional item statistics, they are sample invariant. In addition, the use of item response theory statistics allows one to make a reasonable adjustment for speededness, which is important when, as in this research, the same item administered in different positions is likely to be affected differently by speededness, depending upon its location in the test.Five types of analyses were performed as part of this research. The first three types involved analyses of differences between the two estimations of item difficulty (b), item discrimination (a), and pseudoguessing (c) parameters. The fourth type was an analysis of the differences between equatings based on items calibrated when administered in the operational section and equatings based on items calibrated when administered in section V. Finally, an analysis of the regression of the difference between b's on item position within the operational section was conducted. The analysis of estimated item difficulty parameters showed a strong practice effect for analysis of explanations and logical diagrams items and a moderate fatigue effect for reading comprehension items. Analysis of other estimated item parameters, a and c, produced no consistent results for the two test forms analyzed.Analysis of the difference between equatings for Form 3CGR1 reflected the differences between estimated b's found for the verbal, quantitative, and analytical item types. A large practice effect was evident for the analytical section, a small practice effect, probably due to capitalization on chance, was found for the quantitative section, and no effect was found for the verbal section.Analysis of the regression of the difference between b's on item position within the operational section for analysis of explanations items showed a rather consistent relationship for Form ZGR1 and a weaker but still definite relationship for Form 3CGR1.The results of this research strongly suggest one particularly important implication for equating. If an item type exhibits a within‐test context effect, any equating method, e.g., IRT based equating, that uses item data either directly or as part of an equating section score should provide for administration of the items in the same position in the old and new forms. Although a within‐test context effect might have a negligible influence on a single equating, a chain of such equatings might drift because of the systematic bias.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant