Abstract

Low examinee effort is a major threat to valid uses of many test scores. Fortunately, several methods have been developed to detect noneffortful item responses, most of which use response times. To accurately identify noneffortful responses, one must set response time thresholds separating those responses from effortful ones. While other studies have compared the efficacy of different threshold-setting methods, they typically do so using simulated or small-scale data. When large-scale data are used in such studies, they often are not from a computer-adaptive test (CAT), use only a handful of items, or do not comprehensively examine different threshold-setting methods. In this study, we use reading test scores from over 728,923 3rd–8th-grade students in 2056 schools across the United States taking a CAT consisting of nearly 12,000 items to compare threshold-setting methods. In so doing, we help provide guidance to developers and administrators of large-scale assessments on the tradeoffs involved in using a given method to identify noneffortful responses.

Highlights

  • An assumption fundamental to the validity of most intended uses of achievement tests is that examinees are providing maximal effort on the test (AERA et al, 2015)

  • Low effort occurs differentially by subgroup, which can bias achievement gap estimates (Soland, 2018a), and oftentimes affects students who are disengaging from school (Soland & Kuhfeld, 2019; Soland, Jensen, et al, 2019)—in short, measurement is often most impacted among the students for whom it is arguably most important

  • The CUMP approach produced thresholds for only 5987 items because nearly 2000 items had response accuracies that never dipped below the chance rate conditional on response time. This phenomenon likely occurred because the test we used is a computer-adaptive test (CAT), with items targeted at an examinee’s estimated ability level

Read more

Summary

Introduction

An assumption fundamental to the validity of most intended uses of achievement tests is that examinees are providing maximal effort on the test (AERA et al, 2015) This assumption is often violated, especially on tests with few or minimal stakes for students (Jensen et al, 2018; Rios et al 2016; Wise & Kuhfeld, 2020; Wise & Kong, 2005). Schnipke and Scrams (2002) divide test examinees into two categories: those exhibiting “solution behavior” and those exhibiting “rapid-guessing behavior.” Students in the latter category, who respond to a test item without sufficient time to have understood the question, are not engaged with the test during that item (Schnipke & Scrams, 2002; Wise & Kong, 2005). Response times allow one to identify low effort at the itemlevel, unlike other approaches such as person fit statistics (Wise, 2015)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call