Misclassification Produced by Rapid-Guessing Identification Methods and Their Suitability Under Various Conditions.
Response Time Threshold Methods (RTTMs) are widely used to identify rapid-guessing behavior (RG) in low-stakes assessments, yet face two key challenges: (a) inevitable misclassifications due to overlapping response time distributions of engaged and disengaged responses, and (b) lack of agreement on which method to use under varying conditions. This simulation study evaluated five RTTMs. Item responses and response times were generated from either a one-component model without RG or a two-component mixture model with RG in the population. Distribution, item, and person parameters were varied. Results showed that when the population contained RG, the mixture lognormal distribution-based method (MLN) was the most robust approach and estimated precise thresholds closest to the time points at which the misclassification rates were minimized, even when bimodality was more difficult to detect. The cumulative proportion method (CUMP) was less robust but also accurate when successful, though less precise. In addition, when the population did not include RG, CUMP was the only method to set thresholds for a notable proportion of cases. The methods were generally more conservative than liberal, though the mixture response time quantile method (MRTQ) was neither. The results are discussed in the light of prior RG research and the methods' characteristics, and future directions are suggested. Ultimately, for practical settings, we recommend a six-step process for RG identification that utilizes both a mixture modeling approach (MLN or MRTQ) and the CUMP method.
- Research Article
5
- 10.1186/s40536-023-00158-8
- Mar 21, 2023
- Large-scale Assessments in Education
Understanding the cognitive processes, skills and strategies that examinees use in testing is important for construct validity and score interpretability. Although response processes evidence has long been included as an important aspect of validity (i.e., Standards for Educational and Psychological Tests, 1999), relevant studies are often lacking, especially in large scale educational and psychological testing. An important method for studying response processes involves explanatory mathematical modeling of item responses and item response times from variables that represent sources of cognitive complexity. For many item types, examinees may differ in strategies applied to responding to items. Mixture class item response theory models can identify latent classes of examinees with different processes, skills and strategies based on their pattern of item responses. This study will illustrate the use of response times in conjunction with explanatory item response theory models and mixture models, to provide information relevant to test validity and, hence, to score interpretations.
- Research Article
66
- 10.1111/jedm.12060
- Mar 1, 2015
- Journal of Educational Measurement
The assumption of conditional independence between the responses and the response times (RTs) for a given person is common in RT modeling. However, when the speed of a test taker is not constant, this assumption will be violated. In this article we propose a conditional joint model for item responses and RTs, which incorporates a covariance structure to explain the local dependency between speed and accuracy. To obtain information about the population of test takers, the new model was embedded in the hierarchical framework proposed by van der Linden (). A fully Bayesian approach using a straightforward Markov chain Monte Carlo (MCMC) sampler was developed to estimate all parameters in the model. The deviance information criterion (DIC) and the Bayes factor (BF) were employed to compare the goodness of fit between the models with two different parameter structures. The Bayesian residual analysis method was also employed to evaluate the fit of the RT model. Based on the simulations, we conclude that (1) the new model noticeably improves the parameter recovery for both the item parameters and the examinees’ latent traits when the assumptions of conditional independence between the item responses and the RTs are relaxed and (2) the proposed MCMC sampler adequately estimates the model parameters. The applicability of our approach is illustrated with an empirical example, and the model fit indices indicated a preference for the new model.
- Research Article
8
- 10.1111/bmsp.12320
- Sep 5, 2023
- The British journal of mathematical and statistical psychology
The use of joint models for item scores and response times is becoming increasingly popular in educational and psychological testing. In this paper, we propose two new person-fit statistics for such models in order to detect aberrant behaviour. The first statistic is computed by combining two existing person-fit statistics: one for the item scores, and one for the item response times. The second statistic is computed directly using the likelihood function of the joint model. Using detailed simulations, we show that the empirical null distributions of the new statistics are very close to the theoretical null distributions, and that the new statistics tend to be more powerful than several existing statistics for item scores and/or response times. A real data example is also provided using data from a licensure examination.
- Dissertation
- 10.17077/etd.005181
- Dec 1, 2019
Nowadays it is not uncommon that tests, especially high-stakes assessments, are administered with time constraints. When a test is constructed to assess examinees’ abilities in academic knowledge, but the imposed time limits affect examinees’ test performance, speededness effects become a concern. Under such circumstances, inaccurate psychometric results and inferences might be drawn if unidimensional item response theory (IRT) models are applied in testing practice. Speededness detection methods were proposed to identify speeded responses/examinees. Thus, the purpose of the study was to comprehensively investigate how the performance of various detection methods combined with various calibration treatments compared in reducing speededness effects under the 2PL and 3PL IRT models with: (1) simulated test data under various speededness conditions, and (2) real test data. Both simulated and real data analyses were conducted in this study. Two simulation studies were conducted. For the first simulation study, two main factors were investigated: (1) degree of speededness (three levels: None, 10%, and 25%), and (2) IRT calibration model (two models: 2PL, and 3PL). The performance of various combinations of detection methods and calibration treatments were evaluated by assessing Pearson correlation, item parameter recovery, and model-data fit statistics. Data generated in the second simulation study were based on the estimated person and item parameter values obtained from IRT model calibration of the real data used in this study. Thus, the second simulation study served as a link between the pure simulation study and the real data study, because such a generation process enabled the simulated dataset to carry some characteristics of the real data, while true parameter values were known. The real data came from a large pool of a high-stakes standardized assessment items. In the current study, it was found that treating the identified speeded responses as “not-presented” could always lead to more accurate psychometric results compared to the other calibration treatments across various speededness levels under both the 2PL and 3PL IRT models. When the speededness level was large, “removing speeded examinees” could usually yield comparable results compared to “not-presented” treatments across different detection methods, and is a feasible and easily manipulated option in practice. In addition, it was found that detection methods using the item response time (RT) distribution as a speededness indicator (i.e., the INSPECT and VITP methods in the current study) generally showed better performance than the other detection methods in dealing with speededness effects. Moreover, in this study, it was found that the inclusion of the c-parameter could deal with rapid guessing strategy well. Thus, when the speededness level was not large, and mainly caused by rapid guessing behavior, “no treatment” under the 3PL IRT model yielded accurate psychometric results. The findings of the current study provide several feasible options for practitioners when speededness is a concern and unidimentional IRT models are used in the calibration or scoring process. It is hoped that this study will inspire researchers and practitioners to develop new detection methods, or ways of dealing with speededness effects under unidimensional IRT models.
- Supplementary Content
- 10.4225/03/58a5175a22730
- Feb 16, 2017
- Figshare
This dissertation explores an application of finite mixture modelling to self-assessed health (SAH) survey data in the British Household Panel Survey (BHPS), and then considers tests for homogeneity in some examples of finite mixture models. In the application of finite mixture modelling to SAH survey data, the problem studied is how different question wording and response items in the survey question may affect SAH responses. While the usual methods in the literature implicitly assume that all respondents react to the change in response items in a certain manner, a latent class model is introduced that relaxes this assumption. Results show that this latent class model reduces misinformation that may be introduced using the usual methods in the literature. The estimated effect of question wording and response items can potentially be used to predict SAH responses to different SAH questions. The latent class model is one example of finite mixture models, and while the application of the latent class model in the SAH question seems a good fit to the data in various aspects, the problem of whether different latent classes exist in the first place needs to be further explored. In the setting of finite mixture modelling, this is known as testing for homogeneity. The rest of the thesis explores testing for homogeneity in two other examples: the zero-inflated Poisson (ZIP), and the two-component finite mixture model. Testing for homogeneity in finite mixture models is a well-studied statistical problem. While many other studies have focused on deriving the relevant non-standard null distributions of test statistics, a different approach is considered here. By considering alternative models that are close in some sense to the finite mixture models, simple tests can be constructed for which the null distributions of test statistics are known, and which may also have power when the true data generating processes are the finite mixture models. For testing against the ZIP, the alternative model constructed is one that shares similar characteristics to the ZIP and the hurdle Poisson (HP) models. For testing against the two-component finite mixture model, the construction of the alternative model is done by means of a Gram-Charlier expansion. Simulation results show that this approach performs well in terms of size and power for both the ZIP and the two-component finite mixture data generating processes.
- Research Article
111
- 10.1111/bmsp.12114
- Sep 5, 2017
- British Journal of Mathematical and Statistical Psychology
To provide more refined diagnostic feedback with collateral information in item response times (RTs), this study proposed joint modelling of attributes and response speed using item responses and RTs simultaneously for cognitive diagnosis. For illustration, an extended deterministic input, noisy 'and' gate (DINA) model was proposed for joint modelling of responses and RTs. Model parameter estimation was explored using the Bayesian Markov chain Monte Carlo (MCMC) method. The PISA 2012 computer-based mathematics data were analysed first. These real data estimates were treated as true values in a subsequent simulation study. A follow-up simulation study with ideal testing conditions was conducted as well to further evaluate model parameter recovery. The results indicated that model parameters could be well recovered using the MCMC approach. Further, incorporating RTs into the DINA model would improve attribute and profile correct classification rates and result in more accurate and precise estimation of the model parameters.
- Research Article
5
- 10.3389/feduc.2020.607260
- Jan 7, 2021
- Frontiers in Education
Open (open-book) online assessment has become a great tool in higher education, which is frequently used for monitoring learning progress and teaching effectiveness. It has been gaining popularity because it is flexible to use and makes response behavior data available for researchers to study response processes. However, some challenges are encountered in analyzing these data, such as how to handle outlying response time, how to make use of the information from item response order, how item response time, response order and item scores are related, and how to help classroom teachers quickly check whether student responses are aligned with the design of the assessment. The purposes of this study are 3-fold: (1) to provide a solution for handling outlying response times due to the design of open online formative assessments (i.e., ample or unrestricted testing time), (2) to propose a new measure for investigating the item response order, and (3) to discuss two analytical approaches that are useful for studying response behaviors–data visualization and the Bayesian generalized linear mixed effects model (B-GLMM). An application of these two approaches is illustrated using open online quiz data. Our findings obtained from B-GLMM showed that item response order was related to item response time, but not to item scores; and item response time was related to item scores, but its effects were moderated by the cognitive level. Additionally, the findings from both B-GLMM and data visualization were consistent, which assisted instructors to see the alignment of student responses with the assessment design.
- Research Article
- 10.1111/insr.12462
- Jul 5, 2021
- International Statistical Review
In many applications of two-component mixture models such as the popular zero-inflated model for discrete-valued data, it is customary for the data analyst to evaluate the inherent heterogeneity in view of observed data. To this end, the score test, acclaimed for its simplicity, is routinely performed. It has long been recognized that this test may behave erratically under model misspecification, but the implications of this behavior remain poorly understood for popular two-component mixture models. For the special case of zero-inflated count models, we use data simulations and theoretical arguments to evaluate this behavior and discuss its implications in settings where the working model is restrictive with regard to the true data generating mechanism. We enrich this discussion with an analysis of count data in HIV research, where a one-component model is shown to fit the data reasonably well despite apparent extra zeros. These results suggest that a rejection of homogeneity does not imply that the underlying mixture model is appropriate. Rather, such a rejection simply implies that the mixture model should be carefully interpreted in the light of potential model misspecifications, and further evaluated against other competing models.
- Research Article
5
- 10.1016/s0378-3758(96)00080-8
- Nov 1, 1996
- Journal of Statistical Planning and Inference
On the conditional and mixture model approaches for matched pairs
- Research Article
5
- 10.1186/s40536-023-00179-3
- Sep 1, 2023
- Large-scale Assessments in Education
Unfavorable test-taking behaviors, such as speededness and disengagement, have long been a validity concern for large-scale low-stakes assessments. Understanding the presence and extent of such behaviors is important for ensuring the validity of inferences based on test scores. This study examined test-taking behaviors using item response time (RT), a process data-derived variable from the TIMSS 2019 database. Analyses compared the United States to three other countries (England, Singapore, and the United Arab Emirates) that administered the digital version of TIMSS (eTIMSS) 2019 in English at grade 8. Test-taking behaviors were identified within each country and compared within and across countries. Specifically, to identify distinct types of test-taking behaviors, mixture modeling was employed on RT and item scores from Booklet 1, Part 1, of the eTIMSS 2019 eighth-grade assessment. The results indicated that each country had several latent classes of students with different pacing trajectories and performance. The test-taking behaviors of these latent classes were labeled as Steady; Disengaged or Very disengaged; Speeded or Very speeded; and Efficient and high-performing. Most of the students in each country had a Steady pace (medium to high sum score; steady RT throughout the test): 71% in England, 74% in both Singapore and the United Arab Emirates, and 84% in the United States. Disengaged or Very disengaged students (low sum score; short RT) were identified in each country but were more prevalent in England and the United Arab Emirates (above 20% in both) than in the United States and Singapore (both below 10%). The study also revealed small percentages of Speeded or Very speeded students (low to medium sum score; long RT at first but very short RT toward the end) in England, the United Arab Emirates, and the United States (1%, 5%, and 6%, respectively) but not in Singapore. A unique class of Efficient and high-performing students (high sum score; short RT) was identified only in Singapore (24%). This study demonstrated that mixture modeling is a useful technique for identifying distinct test-taking behaviors and highlighted the presence and extent of unfavorable test-taking behaviors within each selected country using data from Booklet 1, Part 1, of the eTIMSS 2019 eighth-grade assessment.
- Research Article
- 10.1177/01466216251316277
- Feb 2, 2025
- Applied psychological measurement
Response time (RT) has been an essential resource for supplementing the estimation accuracy of latent traits and item parameters in educational testing. Most item response theory (IRT) approaches are based on parametric RT models. However, since test takers may alter their behaviors during a test due to motivation or strategy shifts, fatigue, or other causes, parametric IRT models are unlikely to capture such subtle and nonlinear information. In this work, we propose a novel semi-parametric IRT model with O'Sullivan splines to accommodate the flexible mean RT shapes and explore the underlying nonlinear relationships between latent traits and RT. A simulation study was conducted to demonstrate the substantial improvement in parameter estimation achieved by the new model, as well as the detriment of using parametric models in terms of biases and measurement errors. Using this model, a dataset of mathematics test scores and RT from the Programme for International Student Assessment was analyzed to demonstrate the evident nonlinearity and to compare the proposed model with existing models in terms of model fitting. The findings presented in this study indicate the promising nature of the new approach, suggesting its potential as an additional psychometric tool to enhance test reliability and reduce measurement errors.
- Research Article
2
- 10.3390/jintelligence12020023
- Feb 16, 2024
- Journal of Intelligence
There recently have been many studies examining conditional dependence between response accuracy and response times in cognitive tests. While most previous research has focused on revealing a general pattern of conditional dependence for all respondents and items, it is plausible that the pattern may vary across respondents and items. In this paper, we attend to its potential heterogeneity and examine the item and person specificities involved in the conditional dependence between item responses and response times. To this end, we use a latent space item response theory (LSIRT) approach with an interaction map that visualizes conditional dependence in response data in the form of item-respondent interactions. We incorporate response time information into the interaction map by applying LSIRT models to slow and fast item responses. Through empirical illustrations with three cognitive test datasets, we confirm the presence and patterns of conditional dependence between item responses and response times, a result consistent with previous studies. Our results further illustrate the heterogeneity in the conditional dependence across respondents, which provides insights into understanding individuals' underlying item-solving processes in cognitive tests. Some practical implications of the results and the use of interaction maps in cognitive tests are discussed.
- Research Article
16
- 10.3758/s13428-019-01302-5
- May 11, 2020
- Behavior Research Methods
The two-alternative multidimensional forced-choice measurement of personality has attracted researchers' attention for its tolerance to response bias. Moreover, the response time can be collected along with the item response when personality measurement is conducted with computers. In view of this situation, the objective of this study is to propose a Thurstonian D-diffusion item response theory (IRT) model, which combines two key existing frameworks: the Thurstonian IRT model for forced-choice measurement and the D-diffusion IRT model for the response time in personality measurement. The proposed model reflects the psychological theories behind the data-generating mechanism of the item response and response time. A simulation study reveals that the proposed model can successfully recover the parameters and factor structure in typical application settings. A real data application reveals that the proposed model estimates similar but still different parameter values compared to the original Thurstonian IRT model, and this difference can be explained by the response time information. In addition, the proposed model successfully reflects the distance-difficulty relationship between the response time and the latent relative respondent position.
- Research Article
13
- 10.1177/00131644221136142
- Nov 16, 2022
- Educational and psychological measurement
Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.
- Research Article
2
- 10.21031/epod.1398317
- Oct 26, 2024
- Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi
This study aims to explore the intricate relationship between students' response times, item characteristics, and the effort invested during the Programme for International Student Assessment (PISA) 2015 and 2018 cycles. Through the analysis of data obtained from 69 mathematics trend items administered in a computer-based format across both cycles, this research investigates the dynamics of students' response times and their implications on effort and item characteristics. Findings reveal a significant increase in students' mean response times in the 2018 cycle compared to 2015, indicating potentially heightened effort and solution behavior. Notably, item formats exerted a substantial influence on response times, with open-ended items consistently eliciting lengthier response times compared to multiple-choice items. Additionally, a correlation between response times and item difficulty emerged, suggesting that more challenging items tend to consume more time, possibly due to the complexity of involved cognitive processes. Item based effort, assessed through Response Time Fidelity (RTF) indices, highlighted that a majority of students exhibited solution behavior across both cycles to the items.. Moreover, a decrease in the proportion of students displaying rapid-guessing behavior was observed in the 2018 cycle, potentially reflecting increased engagement with the assessment. While providing insights into the interplay of response times, item characteristics, and effort, this study emphasizes the need for further exploration into the multifaceted nature of effort in educational assessments. Overall, this research contributes valuable perspectives on nuances surrounding test performance and effort evaluation within PISA mathematics assessments.