Students’ proficiency scores within multitrait item response theory

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In this paper we present a series of item response models of data collected using the Force Concept Inventory. The Force Concept Inventory (FCI) was designed to poll the Newtonian conception of force viewed as a multidimensional concept, that is, as a complex of distinguishable conceptual dimensions. Several previous studies have developed single-trait item response models of FCI data; however, we feel that multidimensional models are also appropriate given the explicitly multidimensional design of the inventory. The models employed in the research reported here vary in both the number of fitting parameters and the number of underlying latent traits assumed. We calculate several model information statistics to ensure adequate model fit and to determine which of the models provides the optimal balance of information and parsimony. Our analysis indicates that all item response models tested, from the single-trait Rasch model through to a model with ten latent traits, satisfy the standard requirements of fit. However, analysis of model information criteria indicates that the five-trait model is optimal. We note that an earlier factor analysis of the same FCI data also led to a five-factor model. Furthermore the factors in our previous study and the traits identified in the current work match each other well. The optimal five-trait model assigns proficiency scores to all respondents for each of the five traits. We construct a correlation matrix between the proficiencies in each of these traits. This correlation matrix shows strong correlations between some proficiencies, and strong anticorrelations between others. We present an interpretation of this correlation matrix.

Similar Papers
  • Research Article
  • 10.1001/jamaoto.2025.2691
Patient-Centered Approach to Assessing Voice Impairment
  • Sep 11, 2025
  • JAMA Otolaryngology–Head & Neck Surgery
  • Elizabeth G Willard + 5 more

The Voice Handicap Index-10 (VHI-10) is an established instrument with clear utility. However, national agencies are emphasizing the importance of patient-centered assessments beyond diagnostic test results. How patients view the VHI-10 and its items is not known. To understand patients' perceptions of the VHI-10 items and to identify a potential subset of 6 items for use as a shorter patient-centered assessment. This was a prospective psychometric and patient-centered study conducted at tertiary care and community-based laryngology practices with consecutive adult patients who presented for laryngology evaluation from January 1, 2023, to December 31, 2024. Consecutive responses to the VHI-10 questionnaire were evaluated using factor and item response theory (IRT) analyses. Participants ranked VHI-10 items and provided qualitative feedback, which was inductively coded. Participants were asked what is "most important to you and your voice experience" when evaluating 3 proposed shorter subsets of the VHI-10. VHI-10 questionnaire and 3 subsets of 6 items each, including item ranking (evaluated by factor analysis and item response theory). Factor analysis and item response theory were used to produce 3 subsets of the VHI-10 for quantitative and qualitative assessment by participants. The analysis included data from 6048 consecutive patients (mean [SD] age, 52.0 [8.4] years; 3326 female [55%] and 2722 male [45%] individuals) with completed VHI-10 questionnaires that were evaluated via factor analysis and item response theory (IRT) assessment. In addition, 461 consecutive patients prioritized the VHI-10 items and 521 rated each of the 3 potential subsets. Factor analysis confirmed unidimensionality and IRT analysis demonstrated that items 4, 3, 6, and 1 had the highest discrimination parameters, while items 6, 7, and 1 were most frequently ranked as most or more important; item 5 was included in all sets because of prior clinician and patient input on its importance. Of the 3 subsets proposed, the patients favored set 1, which was composed of these items from the VH1-10: (1) my voice makes it difficult for people to hear me; (2) people have difficulty understanding me in a noisy room; (3) my voice difficulties restrict personal and social life; (6) I feel as though I have to strain to produce voice; and (7) the clarity of my voice is unpredictable; plus item (5), my voice problem causes me to lose income. This psychometric study identified a shorter version of the VHI-10 that may be more patient-centered and clinically sufficient for assessing patients with voice impairments. These findings may form the foundation for additional assessments that are more patient-centered, efficient, and nuanced.

  • Research Article
  • 10.59863/optz4045
Latent Variable Estimation in Factor Analysis and Item Response Theory
  • Dec 1, 2022
  • Chinese/English Journal of Educational Measurement and Evaluation
  • David Thissen

This essay sketches the historical development of latent variable scoring procedures in the item response theory (IRT) and factor analysis literatures, observing that the most commonly used score estimates in both traditions are fundamentally the same; only methods of calculation differ. Different procedures have been used to derive factor score estimates and latent variable estimates in IRT, and different computational procedures have been the result. Due to differences in the context of score usage, challenges have led to different solutions in the IRT and factor analytic traditions. The needs for bias corrections differ, as do the corrections that have been proposed. While the standard factor analysis model has naturally Gaussian likelihoods, IRT does not, but in IRT normal approximations have been used in various contexts to make the IRT computations more like those of factor analysis. Finally, factor analysis alone has been the home of decades of controversy over factor score indeterminacy, while IRT has not, even though the scores in question are the same. That is an artifact of history and the ways the models have been written in the IRT and factor analytic literatures. IRT has never been plagued with questions of indeterminacy, which helps to clarify the position that what is referred to as indeterminacy is not a problem.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s11336-021-09806-w
Assessing the Accuracy of Errors of Measurement. Implications for Assessing Reliable Change in Clinical settings.
  • Sep 1, 2021
  • Psychometrika
  • Alberto Maydeu-Olivares

Item response theory (IRT) models are non-linear latent variable models for discrete measures, whereas factor analysis (FA) is a latent variable model for continuous measures. In FA, the standard error (SE) of individuals' scores is common for all individuals. In IRT, the SE depends on the individual's score, and the SE function is to be provided. The empirical standard deviation of the scores across discrete ranges should also be computed to inform the extent to which IRT SEs overestimate or underestimate the variability of the scores. Within the target range of scores the test was designed to measure, one should expect IRT SEs to be smaller and more precise than FA SEs, and therefore preferable to assess clinical change. Outside the target range, IRT SEs may be too large and more imprecise than FA SEs, and FA more precise to assess change. As a result, whether FA or IRT characterize reliable change more accurately in a sample will depend on the proportion of individuals within or outside the IRT target score range. An application is provided to illustrate these concepts.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/fie.2016.7757427
A factor analysis of Statics Concept Inventory data from practicing civil engineers
  • Oct 1, 2016
  • Oai Ha + 1 more

This study reports the factor analysis of Statics Concept Inventory (SCI) data collected from 95 practicing civil engineers in the Pacific Northwest. In comparison to students' responses reported from previous studies, the analysis of the engineers' data yielded a different number of underlying latent traits and different loading patterns of the SCI items on each trait. The study revealed that engineers' responses to the SCI might reflect the conceptual coherence associated with knowledge of engineering practice. The engineers' combination of discreet concepts into broader and meaningful concepts in this study might provide the evidence about experts' characteristics in processing, organizing, and storing knowledge in chunks. Understanding the experts' knowledge structure would help inform the development of curricular materials and assessment instruments for undergraduate engineering education.

  • Research Article
  • Cite Count Icon 30
  • 10.1080/09638280500404263
Validation of the NHANES ADL scale in a sample of patients with report of cervical pain: Factor analysis, item response theory analysis, and line item validity
  • Jan 1, 2006
  • Disability and Rehabilitation
  • Chad E Cook + 5 more

Background. Few functional outcomes scales have used Item Response Theory (IRT) for validation. IRT allows individual line item validations and offers substantial advantages over classic methods of scale validation or the simplest from of IRT known as Rasch. Rasch analysis reduces data to dichotomous variables thus decreasing the sensitivity of Likert-type data responses.Purpose. The purpose of this study was to create an outcome scale associated with the latent trait of functioning and disability, validated using IRT, in a population with report of cervical pain.Methods. Using the NHANES database, a recently created scale (NHANES ADL scale) was analysed using factor analysis, internal analyses of consistency, IRT, and comparison with internal measures of functioning and disability.Results. The newly created NHANES ADL scale demonstrated uni-dimensionality, was internally reliable, and was correlated to internal measures of functioning and disability. Additionally, the majority of the scale items demonstrate strong discrimination and suitable thresholds.Discussion. The NHANES ADL scale effectively measures physical, social, and emotional disability in patients with a cervical impairment, and may be an efficient measure of perceived limitations from working and generalized daily physical activity.Conclusion. The newly created NHANES ADL scale demonstrates internal consistency, unidimensionality, and line item validity. The NHANES ADL scale appears to be a useful instrument in measurement of functioning and disability in patients with report of cervical pain.

  • Research Article
  • Cite Count Icon 4
  • 10.1080/00952990.2021.2012185
Cocaine use disorder criteria in a clinical sample: an analysis using item response theory, factor and network analysis
  • Jan 19, 2022
  • The American Journal of Drug and Alcohol Abuse
  • M Sanchez-Garcia + 4 more

Background The conceptualization of substance use disorders (SUDs) was modified in successive editions of the DSM. Dimensionality and inclusion/exclusion of several criteria was studied using various analytic approaches. Objective The study aimed to deepen our knowledge of the interrelationships between the diagnostic criteria for cocaine use disorder (CUD), applying three different analytical techniques: factor analysis, Item Response Theory (IRT) models, and network analysis. Methods 425 (85.4% male) outpatients were evaluated for CUD using the Substance Dependence Severity Scale. Confirmatory Factor Analysis, 2-parameter logistic model (IRT) and network analysis were applied to analyze the relationships between the diagnostic criteria. Results The results show that “legal problems” criterion is not congruent with the CUD measure on three analyses. Also, network analysis suggests the usefulness of the “craving” criterion. The criterion “quit/control” is the one that presents the best centrality indices and expected influence, showing strong relationships with the criteria of “craving,” “tolerance,” “neglect roles” and “activities given up.” Conclusions Network analysis appears to be a useful and complementary technique to factor analysis and IRT for understanding CUD. The “quit/control” criterion emerges as a central criterion to understand CUD.

  • Research Article
  • 10.1093/eurpub/ckab164.646
Developing a Frailty Index and investigating its psychometric properties using Item Response Theory
  • Oct 20, 2021
  • European Journal of Public Health
  • N Kleinenberg-Talsma + 4 more

Background The proportion of frail older adults is increasing and is expected to further increase in the coming years, both globally and in the Dutch population. This poses a great challenge to public health. To determine the prevalence of frailty in a population, a frailty index (FI) is recommended. A FI is an accumulation model encompassing health deficits in multiple domains. Previous research has shown that a FI can be created out of existing health surveys, since it is a flexible instrument, fairly insensitive to the use of specific items. However, this is based on scale development using Classical Test Theory, while few studies have investigated the psychometric properties of their FI using Item Response Theory (IRT). The aim of this study was to create a FI using the Dutch Health Monitor 2016, and to investigate its psychometric properties using Item Response Theory (IRT). Methods Forty-two deficits were selected in three health domains, i.e., physical, psychological, and social. Psychometric properties were investigated by using an IRT model for polytomous response categories: the Graded Response Model (GRM). Items were evaluated by Cronbach's Alpha, Factor Analysis, Point Polyserial Correlations, and GRM. Results The analyses showed that all items demonstrated a positive association with the scale. However, five items did not fit well to the FI scale. From the physical domain these were body mass index and three items about adherence to physical activity guidelines: moderate activity per week; bone and muscle strengthening activities; balance exercises. From the psychological domain this was an item about a sense of control over one's own future. Conclusions By using IRT, we showed that while 37 items were adequate and fitted the scale well, five items in our FI were redundant, indicating that it does matter which items are selected for a FI. IRT is a strong method for item selection and thus for creating a more concise Frailty Index. Key messages Creating a solid and more concise Frailty Index with IRT is promising for epidemiological research and public health. For creating a Frailty Index, item selection needs careful consideration.

  • Research Article
  • Cite Count Icon 3
  • 10.1044/2023_jslhr-22-00458
Item Response Theory Modeling of the Verb Naming Test.
  • Mar 31, 2023
  • Journal of Speech, Language, and Hearing Research
  • Gerasimos Fergadiotis + 7 more

Item response theory (IRT) is a modern psychometric framework with several advantageous properties as compared with classical test theory. IRT has been successfully used to model performance on anomia tests in individuals with aphasia; however, all efforts to date have focused on noun production accuracy. The purpose of this study is to evaluate whether the Verb Naming Test (VNT), a prominent test of action naming, can be successfully modeled under IRT and evaluate its reliability. We used responses on the VNT from 107 individuals with chronic aphasia from AphasiaBank. Unidimensionality and local independence, two assumptions prerequisite to IRT modeling, were evaluated using factor analysis and Yen's Q 3 statistic (Yen, 1984), respectively. The assumption of equal discrimination among test items was evaluated statistically via nested model comparisons and practically by using correlations of resulting IRT-derived scores. Finally, internal consistency, marginal and empirical reliability, and conditional reliability were evaluated. The VNT was found to be sufficiently unidimensional with the majority of item pairs demonstrating adequate local independence. An IRT model in which item discriminations are constrained to be equal demonstrated fit equivalent to a model in which unique discrimination parameters were estimated for each item. All forms of reliability were strong across the majority of IRT ability estimates. Modeling the VNT using IRT is feasible, yielding ability estimates that are both informative and reliable. Future efforts are needed to quantify the validity of the VNT under IRT and determine the extent to which it measures the same construct as other anomia tests. https://doi.org/10.23641/asha.22329235.

  • Conference Article
  • 10.1063/1.4992683
Item response theory – A first approach
  • Jan 1, 2017
  • Sandra Nunes + 2 more

The Item Response Theory (IRT) has become one of the most popular scoring frameworks for measurement data, frequently used in computerized adaptive testing, cognitively diagnostic assessment and test equating. According to Andrade et al. (2000), IRT can be defined as a set of mathematical models (Item Response Models – IRM) constructed to represent the probability of an individual giving the right answer to an item of a particular test. The number of Item Responsible Models available to measurement analysis has increased considerably in the last fifteen years due to increasing computer power and due to a demand for accuracy and more meaningful inferences grounded in complex data. The developments in modeling with Item Response Theory were related with developments in estimation theory, most remarkably Bayesian estimation with Markov chain Monte Carlo algorithms (Patz & Junker, 1999). The popularity of Item Response Theory has also implied numerous overviews in books and journals, and many connections between IRT and other statistical estimation procedures, such as factor analysis and structural equation modeling, have been made repeatedly (Van der Lindem & Hambleton, 1997). As stated before the Item Response Theory covers a variety of measurement models, ranging from basic one-dimensional models for dichotomously and polytomously scored items and their multidimensional analogues to models that incorporate information about cognitive sub-processes which influence the overall item response process. The aim of this work is to introduce the main concepts associated with one-dimensional models of Item Response Theory, to specify the logistic models with one, two and three parameters, to discuss some properties of these models and to present the main estimation procedures.

  • Research Article
  • Cite Count Icon 13
  • 10.1007/s11136-014-0643-6
The assessment of publication pressure in medical science; validity and reliability of a Publication Pressure Questionnaire (PPQ)
  • Feb 13, 2014
  • Quality of Life Research
  • J K Tijdink + 4 more

To determine content validity, structural validity, construct validity and reliability of an internet-based questionnaire designed for assessment of publication pressure experienced by medical scientists. The Publication Pressure Questionnaire (PPQ) was designed to assess psychological pressure to publish scientific papers. Content validity was evaluated by collecting independent comments from external experts (n=7) on the construct, comprehensiveness and relevance of the PPQ. Structural validity was assessed by factor analysis and item response theory (IRT) using the generalized partial credit model. Pearson's correlation coefficients were calculated to assess potential correlations with the emotional exhaustion and depersonalization subscales of the Maslach Burnout Inventory (MBI). Single test reliability (lambda2) was obtained from the IRT analysis. Content validity was satisfactory. Confirmatory factor analysis did not support the presence of three initially assumed separate domains of publication pressure (i.e., personally experienced publication pressure, publication pressure in general, pressure on position of scientist). After exclusion of the third domain (six items), we performed exploratory factor analysis and IRT. The goodness-of-fit statistics for the IRT assuming a single dimension were satisfactory when four items were removed, resulting in 14 items of the final PPQ. Correlations with the emotional exhaustion and depersonalization scales of the MBI were 0.34 and 0.31, respectively, supporting construct validity. Single test administration reliability lambda2 was 0.69 and 0.90 on the test scores and expected a posteriori scores, respectively. The PPQ seems a valid and reliable instrument to measure publication pressure among medical scientists.

  • Research Article
  • Cite Count Icon 91
  • 10.2196/jmir.6749
A Psychometric Analysis of the Italian Version of the eHealth Literacy Scale Using Item Response and Classical Test Theory Methods
  • Apr 11, 2017
  • Journal of Medical Internet Research
  • Nicola Diviani + 2 more

BackgroundThe eHealth Literacy Scale (eHEALS) is a tool to assess consumers’ comfort and skills in using information technologies for health. Although evidence exists of reliability and construct validity of the scale, less agreement exists on structural validity.ObjectiveThe aim of this study was to validate the Italian version of the eHealth Literacy Scale (I-eHEALS) in a community sample with a focus on its structural validity, by applying psychometric techniques that account for item difficulty.MethodsTwo Web-based surveys were conducted among a total of 296 people living in the Italian-speaking region of Switzerland (Ticino). After examining the latent variables underlying the observed variables of the Italian scale via principal component analysis (PCA), fit indices for two alternative models were calculated using confirmatory factor analysis (CFA). The scale structure was examined via parametric and nonparametric item response theory (IRT) analyses accounting for differences between items regarding the proportion of answers indicating high ability. Convergent validity was assessed by correlations with theoretically related constructs.ResultsCFA showed a suboptimal model fit for both models. IRT analyses confirmed all items measure a single dimension as intended. Reliability and construct validity of the final scale were also confirmed. The contrasting results of factor analysis (FA) and IRT analyses highlight the importance of considering differences in item difficulty when examining health literacy scales.ConclusionsThe findings support the reliability and validity of the translated scale and its use for assessing Italian-speaking consumers’ eHealth literacy.

  • Book Chapter
  • 10.1002/9781118521373.wbeaa320
Item Response Theory
  • Dec 20, 2015
  • Jonathan Templin

Item response theory (IRT) is the name for a collection of psychometric methods that are used for the analysis of test, questionnaire, and survey data with categorical or discrete item responses. In this entry, a brief overview of the logic that underlies IRT is provided. The entry begins by linking IRT with factor analysis, a psychometric method that is currently more prevalent in aging research. Following that, the basics of unidimensional IRT models are detailed, culminating with a short discussion of topics typically discussed in IRT research, such as equating and multidimensional IRT.

  • Research Article
  • Cite Count Icon 66
  • 10.1037/pas0000597
Advances in applications of item response theory to clinical assessment.
  • Dec 1, 2019
  • Psychological Assessment
  • Michael L Thomas

Item response theory (IRT) is moving to the forefront of methodologies used to develop, evaluate, and score clinical measures. Funding agencies and test developers are routinely supporting IRT work, and the theory has become closely tied to technological advances within the field. As a result, familiarity with IRT has grown increasingly relevant to mental health research and practice. But to what end? This article reviews advances in applications of IRT to clinical measurement in an effort to identify tangible improvements that can be attributed to the methodology. Although IRT shares similarities with classical test theory and factor analysis, the approach has certain practical benefits, but also limitations, when applied to measurement challenges. Major opportunities include the use of computerized adaptive tests to prevent conditional measurement error, multidimensional models to prevent misinterpretation of scores, and analyses of differential item functioning to prevent bias. Whereas these methods and technologies were once only discussed as future possibilities, they are now accessible because of recent support of IRT-focused clinical research. Despite this, much work still remains in widely disseminating methods and technologies from IRT into mental health research and practice. Clinicians have been reluctant to fully embrace the approach, especially in terms or prospective test development and adaptive item administration. Widespread use of IRT technologies will require continued cooperation among psychometricians, clinicians, and other stakeholders. There are also many opportunities to expand the methodology, especially with respect to integrating modern measurement theory with models from personality and cognitive psychology as well as neuroscience. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

  • Dissertation
  • 10.25394/pgs.9108656.v1
Affective Engagement in Information Visualization
  • Aug 13, 2019
  • Ya‐Hsin Hung

Evaluating the “success” of an information visualization (InfoVis) where its main purpose is communication or presentation is challenging. Within metrics that go beyond traditional analysis- and performance-oriented approaches, one construct that has received attention in recent years is “user engagement”. In this research, I propose Affective Engagement (AE)-- user's engagement in emotional aspects as a metric for InfoVis evaluation. I developed and evaluated a self-report measurement tool named AEVis that can quantify a user's level of AE while using an InfoVis. Following a systematic process of evidence-centered design, each activity during instrument development contributed specific evidence to support the validity of interpretations of scores from the instrument. Four stages were established for the development: In stage 1, I examined the role and characteristics of AE in evaluating information visualization through an exploratory qualitative study, from which 11 indicators of AE were proposed: Fluidity, Enthusiasm, Curiosity, Discovery, Clarity, Storytelling, Creativity, Entertainment, Untroubling, Captivation, and Pleasing; In stage 2, I developed an item bank comprising various candidate items for assessing a user's level of AE, and assembled the first version of survey instrument through target population and domain experts' feedback; In stage 3, I conducted three field tests for instrument revisions. Three analytical methods were applied during this process: Item Analysis, Factor Analysis (FA), and Item Response Theory (IRT); In stage 4, a follow-up field test study was conducted to investigate the external relations between constructs in AEVis and other existing instruments. The results of the four stages support the validity and reliability of the developed instrument, including: In stage 1, user's AE characteristics elicited from the observations support the theoretical background of the test content; In stage 2, the feedback and review from target users and domain experts provides validity evidence for the test content of the instrument in the context of InfoVis; In stage 3, results from Exploratory and Confirmatory FA, as well as IRT methods reveal evidence for the internal structure of the instrument; In stage 4, the correlations between total scores and sub-scores of AEVis and other existing instruments provide external relation evidence of score interpretations. Using this instrument, visualization researchers and designers can evaluate non-performance-related aspects of their work efficiently and without specific domain knowledge. The utilities and implications of AE can be investigated as well. In the future, this research may provide foundation for expanding the theoretical basis of engagement in the fields of human-computer interaction and information visualization.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3389/fpsyg.2023.1267219
Interchangeability between factor analysis, logistic IRT, and normal ogive IRT.
  • Sep 25, 2023
  • Frontiers in Psychology
  • Eunseong Cho

In existing studies, it has been argued that factor analysis (FA) is equivalent to item response theory (IRT) and that IRT models that use different functions (i.e., logistic and normal ogive) are also interchangeable. However, these arguments have weak links. The proof of equivalence between FA and normal ogive IRT assumes a normal distribution. The interchangeability between the logistic and normal ogive IRT models depends on a scaling constant, but few scholars have examined whether the usual values of 1.7 or 1.702 maximize interchangeability. This study addresses these issues through Monte Carlo simulations. First, the FA model produces almost identical results to those of the normal ogive model even under severe nonnormality. Second, no single scaling constant maximizes the interchangeability between logistic and normal ogive models. Instead, users should choose different scaling constants depending on their purpose in using a model and the number of response categories (i.e., dichotomous or polytomous). Third, the interchangeability between logistic and normal ogive models is determined by several conditions. The interchangeability is high if the data are dichotomous or if the latent variables follow a symmetric distribution, and vice versa. In summary, the interchangeability between FA and normal ogive models is greater than expected, but that between logistic and normal ogive models is not.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon
Setting-up Chat
Loading Interface