Latent Variable Estimation in Factor Analysis and Item Response Theory
This essay sketches the historical development of latent variable scoring procedures in the item response theory (IRT) and factor analysis literatures, observing that the most commonly used score estimates in both traditions are fundamentally the same; only methods of calculation differ. Different procedures have been used to derive factor score estimates and latent variable estimates in IRT, and different computational procedures have been the result. Due to differences in the context of score usage, challenges have led to different solutions in the IRT and factor analytic traditions. The needs for bias corrections differ, as do the corrections that have been proposed. While the standard factor analysis model has naturally Gaussian likelihoods, IRT does not, but in IRT normal approximations have been used in various contexts to make the IRT computations more like those of factor analysis. Finally, factor analysis alone has been the home of decades of controversy over factor score indeterminacy, while IRT has not, even though the scores in question are the same. That is an artifact of history and the ways the models have been written in the IRT and factor analytic literatures. IRT has never been plagued with questions of indeterminacy, which helps to clarify the position that what is referred to as indeterminacy is not a problem.
- Book Chapter
2
- 10.1007/978-3-030-43469-4_14
- Jan 1, 2020
The factor scores of confirmatory factor analysis (CFA) models and the latent variables of item response theory (IRT) models are similar statistical entities, so one would expect that their estimation or characterization would follow parallel tracks in CFA and IRT. However, historically they have not. Different procedures have been used to derive factor score estimates and latent variable estimates in IRT, and different computational procedures have been the result. In this chapter we approach factor score estimation for some simple CFA models from the perspective of IRT, with the kinds of graphics that are used to explain IRT estimates of proficiency, and the computational procedures that are used in test theory. We compare traditional “regression” and “Bartlett” factor score estimates with alternative computational approaches to likelihood-based factor score estimates, referring to the expected a posteriori and maximum likelihood estimates of IRT latent variables to clarify relations among the scores. This provides insights into the ways in which the data are combined into factor score estimates. The results provide an alternative method to compute factor scores in some simple models in the presence of observations that may be missing at random for some variables.
- Research Article
4
- 10.1080/00952990.2021.2012185
- Jan 19, 2022
- The American Journal of Drug and Alcohol Abuse
Background The conceptualization of substance use disorders (SUDs) was modified in successive editions of the DSM. Dimensionality and inclusion/exclusion of several criteria was studied using various analytic approaches. Objective The study aimed to deepen our knowledge of the interrelationships between the diagnostic criteria for cocaine use disorder (CUD), applying three different analytical techniques: factor analysis, Item Response Theory (IRT) models, and network analysis. Methods 425 (85.4% male) outpatients were evaluated for CUD using the Substance Dependence Severity Scale. Confirmatory Factor Analysis, 2-parameter logistic model (IRT) and network analysis were applied to analyze the relationships between the diagnostic criteria. Results The results show that “legal problems” criterion is not congruent with the CUD measure on three analyses. Also, network analysis suggests the usefulness of the “craving” criterion. The criterion “quit/control” is the one that presents the best centrality indices and expected influence, showing strong relationships with the criteria of “craving,” “tolerance,” “neglect roles” and “activities given up.” Conclusions Network analysis appears to be a useful and complementary technique to factor analysis and IRT for understanding CUD. The “quit/control” criterion emerges as a central criterion to understand CUD.
- Research Article
- 10.1177/10497315251335256
- May 2, 2025
- Research on Social Work Practice
Purpose Excessive self-criticism, along with negative self-evaluation and perceived unfavorable judgments from others, often leads to emotional distress. The Levels of Self-Criticism (LOSC) scale identifies two distinct forms of self-criticism: comparative self-criticism and internalized self-criticism, yet with varying psychometric stability across different populations. Method This study developed a shortened, psychometrically robust version of the LOSC by employing item response theory (IRT) and factor analysis to enhance the practicality and reliability of the scale. Results 415 participants completed the baseline survey, and 232 completed the post-test, engaging 83% of females with Mage = 39.73. IRT analysis eliminated 11 items, with the remaining items demonstrating optimal item performance and significant concurrent validity with related measures. This shortened LOSC showed strong test–retest reliability and construct validity. Discussion This streamlined scale provides a precise tool for assessing self-criticism, contributing to better psychological practice and research.
- Research Article
- 10.1057/s41599-025-04927-4
- May 6, 2025
- Humanities and Social Sciences Communications
This study aimed to develop and validate a generic self-assessment scale for Chinese K-12 teachers to evaluate their feedback-giving literacy in classroom settings, using Item Response Theory (IRT), Exploratory Factor Analysis (EFA), and Confirmatory Factor Analysis (CFA). The scale was constructed based on a conceptual framework encompassing four components: knowledge, skills, values, and actionability. A pilot test with 1068 teachers led to the selection of 30 items, which were then validated with a sample of 980 teachers. EFA revealed a clear factor structure, explaining 65.42% of the total variance, while CFA confirmed a good model fit (CFI > 0.9, RMSEA < 0.08). The final scale demonstrated high internal consistency (McDonald’s Omega coefficient = 0.97) across all subscales. IRT analyses indicated strong measurement precision, particularly in the skills and actionability subscales. Although limited to the Chinese K-12 context and based on self-reported data, the findings offer a valuable tool for teachers to assess and improve their feedback practices. The scale can be used for professional development and further research on feedback-giving literacy. Future studies should explore its applicability in different cultural contexts and investigate the development of teacher feedback literacy over time.
- Abstract
5
- 10.1016/j.jval.2014.08.1909
- Oct 26, 2014
- Value in Health
PRM156 - Current Sample Size Practices in the Psychometric Evaluation of Patient-Reported Outcomes for Use in Clinical Trials
- Research Article
626
- 10.1007/bf02294363
- Sep 1, 1987
- Psychometrika
Equivalence of marginal likelihood of the two-parameter normal ogive model in item response theory (IRT) and factor analysis of dichotomized variables (FA) was formally proved. The basic result on the dichotomous variables was extended to multicategory cases, both ordered and unordered categorical data. Pair comparison data arising from multiple-judgment sampling were discussed as a special case of the unordered categorical data. A taxonomy of data for the IRT and FA models was also attempted.
- Research Article
92
- 10.2196/jmir.6749
- Apr 11, 2017
- Journal of Medical Internet Research
BackgroundThe eHealth Literacy Scale (eHEALS) is a tool to assess consumers’ comfort and skills in using information technologies for health. Although evidence exists of reliability and construct validity of the scale, less agreement exists on structural validity.ObjectiveThe aim of this study was to validate the Italian version of the eHealth Literacy Scale (I-eHEALS) in a community sample with a focus on its structural validity, by applying psychometric techniques that account for item difficulty.MethodsTwo Web-based surveys were conducted among a total of 296 people living in the Italian-speaking region of Switzerland (Ticino). After examining the latent variables underlying the observed variables of the Italian scale via principal component analysis (PCA), fit indices for two alternative models were calculated using confirmatory factor analysis (CFA). The scale structure was examined via parametric and nonparametric item response theory (IRT) analyses accounting for differences between items regarding the proportion of answers indicating high ability. Convergent validity was assessed by correlations with theoretically related constructs.ResultsCFA showed a suboptimal model fit for both models. IRT analyses confirmed all items measure a single dimension as intended. Reliability and construct validity of the final scale were also confirmed. The contrasting results of factor analysis (FA) and IRT analyses highlight the importance of considering differences in item difficulty when examining health literacy scales.ConclusionsThe findings support the reliability and validity of the translated scale and its use for assessing Italian-speaking consumers’ eHealth literacy.
- Research Article
22
- 10.1186/s12955-016-0444-4
- Mar 12, 2016
- Health and Quality of Life Outcomes
BackgroundExamine the feasibility of performing an item response theory (IRT) analysis on two of the Centers for Disease Control and Prevention health-related quality of life (CDC HRQOL) modules – the 4-item Healthy Days Core Module (HDCM) and the 5-item Healthy days Symptoms Module (HDSM). Previous principal components analyses confirm that the two scales both assess a mix of mental (CDC-MH) and physical health (CDC-PH). The purpose is to conduct item response theory (IRT) analysis on the CDC-MH and CDC-PH scales separately.Methods2182 patients with self-reported or physician-diagnosed arthritis completed a cross-sectional survey including HDCM and HDSM items. Besides global health, the other 8 items ask the number of days that some statement was true; we chose to recode the data into 8 categories based on observed clustering. The IRT assumptions were assessed using confirmatory factor analysis and the data could be modeled using an unidimensional IRT model. The graded response model was used for IRT analyses and CDC-MH and CDC-PH scales were analyzed separately in flexMIRT.ResultsThe IRT parameter estimates for the five-item CDC-PH all appeared reasonable. The three-item CDC-MH did not have reasonable parameter estimates.ConclusionsThe CDC-PH scale is amenable to IRT analysis but the existing The CDC-MH scale is not. We suggest either using the 4-item Healthy Days Core Module (HDCM) and the 5-item Healthy days Symptoms Module (HDSM) as they currently stand or the CDC-PH scale alone if the primary goal is to measure physical health related HRQOL.
- Research Article
267
- 10.1017/s0033291707001730
- Oct 9, 2007
- Psychological Medicine
A number of scales are used to estimate the severity of depression. However, differences between self-report and clinician rating, multi-dimensionality and different weighting of individual symptoms in summed scores may affect the validity of measurement. In this study we examined and integrated the psychometric properties of three commonly used rating scales. The 17-item Hamilton Depression Rating Scale (HAMD-17), the Montgomery-Asberg Depression Rating Scale (MADRS) and the Beck Depression Inventory (BDI) were administered to 660 adult patients with unipolar depression in a multi-centre pharmacogenetic study. Item response theory (IRT) and factor analysis were used to evaluate their psychometric properties and estimate true depression severity, as well as to group items and derive factor scores. The MADRS and the BDI provide internally consistent but mutually distinct estimates of depression severity. The HAMD-17 is not internally consistent and contains several items less suitable for out-patients. Factor analyses indicated a dominant depression factor. A model comprising three dimensions, namely 'observed mood and anxiety', 'cognitive' and 'neurovegetative', provided a more detailed description of depression severity. The MADRS and the BDI can be recommended as complementary measures of depression severity. The three factor scores are proposed for external validation.
- Research Article
1
- 10.26713/jims.v11i1.951
- Mar 31, 2019
- Journal of Informatics and Mathematical Sciences
The study investigates the effects of response scales of items on results of item response theory models and multivariate techniques. A total of sixty-four datasets have been simulated under various conditions such as item response format, number of dimensions underlying response scales, and sample size using R package mirt command: simdata ( a , d , N , itemtype ). Two main statistical techniques -- Item Response Theory (IRT) models and Factor Analysis -- are employed. We find that there is a direct relationship between parameters of IRT and those of factor models, particularly item discrimination and factor loadings. The results also show that the overall fitness of the item response model increases with increasing scale points for higher dimensionality and sample size 150 and higher. The fitness deteriorates over increasing scale points for small sample sizes for unidimensional model. Again, the number of influential indicators on factors increases with increasing scale-points which improves the fitness of the model. The study suggests that a five-point response scale gives most reasonable results among various scales examined. IRT analysis is recommended as a preliminary process to ascertain the observed features of items. The study also finds a sample size of 150 as adequate for a most plausible factor solution, under various conditions.
- Research Article
64
- 10.1080/10705511.2011.581993
- Jun 30, 2011
- Structural Equation Modeling: A Multidisciplinary Journal
Linear factor analysis (FA) models can be reliably tested using test statistics based on residual covariances. We show that the same statistics can be used to reliably test the fit of item response theory (IRT) models for ordinal data (under some conditions). Hence, the fit of an FA model and of an IRT model to the same data set can now be compared. When applied to a binary data set, our experience suggests that IRT and FA models yield similar fits. However, when the data are polytomous ordinal, IRT models yield a better fit because they involve a higher number of parameters. But when fit is assessed using the root mean square error of approximation (RMSEA), similar fits are obtained again. We explain why. These test statistics have little power to distinguish between FA and IRT models; they are unable to detect that linear FA is misspecified when applied to ordinal data generated under an IRT model.
- Research Article
- 10.1001/jamaoto.2025.2691
- Sep 11, 2025
- JAMA Otolaryngology–Head & Neck Surgery
The Voice Handicap Index-10 (VHI-10) is an established instrument with clear utility. However, national agencies are emphasizing the importance of patient-centered assessments beyond diagnostic test results. How patients view the VHI-10 and its items is not known. To understand patients' perceptions of the VHI-10 items and to identify a potential subset of 6 items for use as a shorter patient-centered assessment. This was a prospective psychometric and patient-centered study conducted at tertiary care and community-based laryngology practices with consecutive adult patients who presented for laryngology evaluation from January 1, 2023, to December 31, 2024. Consecutive responses to the VHI-10 questionnaire were evaluated using factor and item response theory (IRT) analyses. Participants ranked VHI-10 items and provided qualitative feedback, which was inductively coded. Participants were asked what is "most important to you and your voice experience" when evaluating 3 proposed shorter subsets of the VHI-10. VHI-10 questionnaire and 3 subsets of 6 items each, including item ranking (evaluated by factor analysis and item response theory). Factor analysis and item response theory were used to produce 3 subsets of the VHI-10 for quantitative and qualitative assessment by participants. The analysis included data from 6048 consecutive patients (mean [SD] age, 52.0 [8.4] years; 3326 female [55%] and 2722 male [45%] individuals) with completed VHI-10 questionnaires that were evaluated via factor analysis and item response theory (IRT) assessment. In addition, 461 consecutive patients prioritized the VHI-10 items and 521 rated each of the 3 potential subsets. Factor analysis confirmed unidimensionality and IRT analysis demonstrated that items 4, 3, 6, and 1 had the highest discrimination parameters, while items 6, 7, and 1 were most frequently ranked as most or more important; item 5 was included in all sets because of prior clinician and patient input on its importance. Of the 3 subsets proposed, the patients favored set 1, which was composed of these items from the VH1-10: (1) my voice makes it difficult for people to hear me; (2) people have difficulty understanding me in a noisy room; (3) my voice difficulties restrict personal and social life; (6) I feel as though I have to strain to produce voice; and (7) the clarity of my voice is unpredictable; plus item (5), my voice problem causes me to lose income. This psychometric study identified a shorter version of the VHI-10 that may be more patient-centered and clinically sufficient for assessing patients with voice impairments. These findings may form the foundation for additional assessments that are more patient-centered, efficient, and nuanced.
- Research Article
23
- 10.1186/1477-7525-12-32
- Jan 1, 2014
- Health and Quality of Life Outcomes
BackgroundThe occurrence of response shift (RS) in longitudinal health-related quality of life (HRQoL) studies, reflecting patient adaptation to disease, has already been demonstrated. Several methods have been developed to detect the three different types of response shift (RS), i.e. recalibration RS, 2) reprioritization RS, and 3) reconceptualization RS. We investigated two complementary methods that characterize the occurrence of RS: factor analysis, comprising Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA), and a method of Item Response Theory (IRT).MethodsBreast cancer patients (n = 381) completed the EORTC QLQ-C30 and EORTC QLQ-BR23 questionnaires at baseline, immediately following surgery, and three and six months after surgery, according to the “then-test/post-test” design. Recalibration was explored using MCA and a model of IRT, called the Linear Logistic Model with Relaxed Assumptions (LLRA) using the then-test method. Principal Component Analysis (PCA) was used to explore reconceptualization and reprioritization.ResultsMCA highlighted the main profiles of recalibration: patients with high HRQoL level report a slightly worse HRQoL level retrospectively and vice versa. The LLRA model indicated a downward or upward recalibration for each dimension. At six months, the recalibration effect was statistically significant for 11/22 dimensions of the QLQ-C30 and BR23 according to the LLRA model (p ≤ 0.001). Regarding the QLQ-C30, PCA indicated a reprioritization of symptom scales and reconceptualization via an increased correlation between functional scales.ConclusionsOur findings demonstrate the usefulness of these analyses in characterizing the occurrence of RS. MCA and IRT model had convergent results with then-test method to characterize recalibration component of RS. PCA is an indirect method in investigating the reprioritization and reconceptualization components of RS.
- Research Article
80
- 10.1027/2698-1866/a000034
- Feb 1, 2023
- Psychological Test Adaptation and Development
The importance of providing structural validity evidence for test score(s) derived from psychometric test instruments is highlighted by several institutions; for example, the American Psychological Association (2014) demands that evidence for the validity of an instruments' internal structure and its underlying measurement model must be provided before it is applied in psychological assessment. The knowledge about the latent structure of data obtained with tests addressing the major question "What is/are the construct[s] being measured" by psychological tests under investigation (Ziegler, 2014 (Ziegler, , 2020)) . The study of structural validity is typically addressed with factor analyses when the test scores reflect continuous latent traits. As most submissions to Psychological Test Adaptation and Development (PTAD) deal with the adaptation and further development of existing measures, authors typically test a measurement model that is based on theoretical considerations and prior findings on original versions (or adaptations) of the test under investigation. Our literature review of PTAD's publications showed that more than 90% of the articles contain at least one confirmatory factor analysis (CFA). As editor and reviewers of PTAD, we appreciate that authors are rigorous in providing evidence on the structural validity of their tests' data. However, since PTAD's inception in 2019, we experience that one comment is frequently communicated to authors during the review process, namely, the request to adjust the analytic approach in CFA from maximum likelihood (ML) estimation toward using the mean-and variance-adjusted weighted least squares (WLSMV; Muthén et al., 1997) estimator to account for the ordinal nature of the data that psychological instruments typically generate on the item level. In this editorial, we discuss the rationale behind choosing the WLSMV estimator when analyzing test adaptations and developments that are based on ordinal categorical data and concisely illustrate the problems associated with using the ML estimator (potentially in combination with robust tests of model fit) for such data.
- Research Article
4
- 10.2147/ppa.s269255
- Nov 1, 2020
- Patient Preference and Adherence
PurposeThis study aimed to simplify the version-1 Chinese and Western medication adherence scale for patients with chronic kidney disease (CKD) to a version-2 scale using item response theory (IRT) analyses, and to further evaluate the performance of the version-2 scale.Materials and MethodsFirstly, we refined the version-1 scale using IRT analyses to examine the discrimination parameter (a), difficulty parameter (b) and maximum information function peak (Imax). The final scale refinement from version-1 to version-2 scale was also decided upon clinical considerations. Secondly, we analyzed the reliability and validity of version-2 scale using classical test theory (CTT), as well as difficulty, discrimination and Imax of version-1 and version-2 scale using IRT in order to conduct scale evaluation.ResultsFor scale refinement, the 26-item version-1 scale was reduced to a 15-item version-2 scale after IRT analyses. For scale evaluation using CTT, internal consistency reliability (total Cronbach α = 0.842) and test-rest reliability (r = 0.909) of version-2 scale were desirable. Content validity indicated 3 components of knowledge, belief and behaviors. We found meritorious construct validity with 3 detected components as the same construct of medication knowledge (items 1–9), medication behavior (items 13–15), and medication belief (items 10–12) based upon exploratory factor analysis. The correlation between the version-2 scale and Morisky, Green and Levine scale (MGL scale) was weak (Pearson coefficient = 0.349). For scale evaluation with IRT, the findings showed enhanced discrimination and decreased difficulty of most retained items (items 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15), decreased Imax of items 1, 2, 3, 4, 6, 11, 14, as well as increased Imax of items 5, 7, 8, 9, 10, 12, 13, 14, 15 in the version-2 scale than in the version-1 scale.ConclusionThe original Chinese and Western medication adherence scale was refined to a 15-item version-2 scale after IRT analyses. The scale evaluation using CTT and IRT showed the version-2 scale had the desirable reliability, validity, discrimination, difficulty, and information providedoverall. Therefore, the version-2 scale is clinically feasible to assess the medication adherence of CKD patients.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.