Validity Of Test Score Interpretations Research Articles

There has been little research investigating the predictive validity of modern intelligence tests for racially and ethnically diverse students. The validity of test score interpretation within educational and psychological assessment assumes that test scores predict educationally relevant phenomena equally well for individuals, regardless of group membership (American Educational Research Association et al., 2014; Messick, 1995; Warne et al., 2014). We used multiple group latent variable structural equation modeling (SEM) to investigate Cattell-Horn-Carroll general (g) and broad cognitive abilities on reading and mathematics achievement and whether these differed between racial (African American, Asian, and Caucasian) and ethnic (Hispanic, non-Hispanic) children and adolescents within the Woodcock-Johnson IV norming sample (N = 3127). After establishing construct equivalence across racial and ethnic groups, supporting the consistent calculation of composite scores regardless of group membership, we then examined the predictive validity of intelligence on achievement. After controlling for parent education, findings suggested two instances of differential predictive relations: (a) general intelligence had larger influences on basic reading skills for Caucasians when compared to Asian peers, and (b) comprehension-knowledge had larger influences on basic reading skills for Asians when compared to Caucasian peers. The overall pattern of findings suggests there is little to no predictive bias with the WJ IV. However, the findings indicate that when latent mean differences exist (after establishing strong factorial invariance), then bias will be introduced into the estimation of regression parameters used to identify differential predictive validity. Thus, even when measurement invariance is supported, differential prediction bias is inevitable when there are mean differences in the scores used as predictors. Future test bias research should consider latent ability differences and how that may impact findings of bias, and possibly, socioeconomic status-related indicators when assessing for measurement or prediction bias in intelligence and achievement tests.

Read full abstract

Validity theory has evolved dramatically in the past few decades. The most prominent theory in recent years is an argument-based validity framework, proposed by Kane (1992, 2004, 2006). To evaluate test score interpretations and uses based on Kane’s framework, test developers first need to provide interpretive arguments and then validity arguments by proving sound warrants for the following four inferences: (a) scoring from observation to an observed score, (b) generalization from the observed score to the universe score, (c) extrapolation from the universe score to a target score, and (d) decision from the target score to use. In the field of language testing, a number of studies have been conducted to investigate the validity of test score interpretations and uses, especially for the ones considered to be high-stakes such as the TOEFL (Chapelle, 2008; Chapelle, Enright, & Jamieson, 2010). However, not many studies have been conducted to validate in-house placement test score interpretations and uses, and no study has evaluated the validity of such low-stakes tests using Kane’s validity framework. Regardless of whether the tests are high or low stakes, test developers need to be responsible for validating their test score interpretations and uses in order to attest to the validity. This study uses Kane’s (2006) argument-based validity framework to evaluate the validity of in-house placement test score interpretations and uses. The research questions are as follows: (a) to what extent do examinees get placement items correct and high-scoring examinees get more placement items correct; (b) to what extent are placement items consistently sampled from a domain sufficient in number so as to reduce measurement error; (c) to what extent do the difficulty of placement items match the objectives of a reading course; and (d) to what extent do placement decisions made to place examinees in their proper level of the course have an impact on washback in the course? An in-house placement test made up of 40-item grammar, 40-item vocabulary, and 10-item reading sections was developed and administered to 428 first-year private-university students in April 2010. The item format adopted was all multiple-choice so the answer sheets could be easily scored with a reader. Based on their test scores, about 60 high-scoring students and 50 low-scoring students were placed into one of two advanced or one of two basic reading classes. The remaining students were placed into one of several intermediate classes. A 55-item grammar achievement test was administered twice (once as a pretest and then again as a posttest) to the two basic and two intermediate classes. In addition, a 51-item class evaluation survey was administered to investigate students’ participation in the reading classes and to gauge students’ satisfaction with the classes and study support. Warrants for a validity argument of score inference were based on the results of the item analysis. A warrant for a validity argument of generalization inference was based on the composite generalizability coefficient of .92. A warrant for a validity argument of extrapolation inference was based on FACETS analysis, showing that difficulty estimates of learning levels were in an expected difficulty order. A warrant for validity arguments of decision inference was based on the basic-level students’ score gain on an achievement test and their positive reactions to a class evaluation survey. All the validity arguments presented in this study support the validity of the placement test score interpretations and uses. However, to further improve the validity of the test score interpretations and uses, it is necessary to investigate washback effects of the placement test in the reading classes and to revise the test to make grammar, vocabulary, and reading sections with 30 items each.

Read full abstract

Validity Of Test Score Interpretations Research Articles

Related Topics

Articles published on Validity Of Test Score Interpretations

Validity of test score interpretations and cross-cultural comparisons in the First and Second International Science Studies

Using Item Scores and Distractors to Detect Item Compromise and Preknowledge

Examination of differential effects of cognitive abilities on reading and mathematics achievement across race and ethnicity: Evidence with the WJ IV

Application of Sampling Variance of Item Response Theory Parameter Estimates in Detecting Outliers in Common Item Equating.

Development and initial validation of scales to assess Decisional Balance (NSSI-DB), Processes of Change (NSSI-POC), and Self-Efficacy (NSSI-SE) in a population of young adults engaging in nonsuicidal self-injury.

Construct validity of the G-CPAQ and its mediating role in pain interference and adjustment.

Differential Item Functioning: Beyond validity evidence based on internal structure.

Factor structure and validity of the State-Trait Inventory for Cognitive and Somatic Anxiety.

What Are We Aligning Tests to When We Report Test Alignment to the CEFR?

Assessment of nonsuicidal self-injury: development and initial validation of the Non-Suicidal Self-Injury-Assessment Tool (NSSI-AT).

Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee Characteristics

Motivation Filtering on a Multi-Institution Assessment of General College Outcomes

Test Anxiety, Computer -Adaptive Testing, and the Common Core

学内開発プレイスメントテスト得点解釈と使用の妥当性の評価について Evaluating Validity for In-House Placement Test Score Interpretations and Uses

The Impact of Multidirectional Item Parameter Drift on IRT Scaling Coefficients and Proficiency Estimates

From test validity to construct validity … and back?

Effects of diurnal variation on the Test of Variables of Attention performance in young adults with attention-deficit/hyperactivity disorder.

Effects of diurnal variation and caffeine consumption on Test of Variables of Attention (TOVA) performance in healthy young adults.

Effect of first-encounter pretest on pass/fail rates of a clinical skills medical licensure examination

Summative Assessment in Medicine: The Promise of Simulation for High‐stakes Evaluation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Validity Of Test Score Interpretations Research Articles

Related Topics

Articles published on Validity Of Test Score Interpretations

Validity of test score interpretations and cross-cultural comparisons in the First and Second International Science Studies

Using Item Scores and Distractors to Detect Item Compromise and Preknowledge

Examination of differential effects of cognitive abilities on reading and mathematics achievement across race and ethnicity: Evidence with the WJ IV

Application of Sampling Variance of Item Response Theory Parameter Estimates in Detecting Outliers in Common Item Equating.

Development and initial validation of scales to assess Decisional Balance (NSSI-DB), Processes of Change (NSSI-POC), and Self-Efficacy (NSSI-SE) in a population of young adults engaging in nonsuicidal self-injury.

Construct validity of the G-CPAQ and its mediating role in pain interference and adjustment.

Differential Item Functioning: Beyond validity evidence based on internal structure.

Factor structure and validity of the State-Trait Inventory for Cognitive and Somatic Anxiety.

What Are We Aligning Tests to When We Report Test Alignment to the CEFR?

Assessment of nonsuicidal self-injury: development and initial validation of the Non-Suicidal Self-Injury-Assessment Tool (NSSI-AT).

Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee Characteristics

Motivation Filtering on a Multi-Institution Assessment of General College Outcomes

Test Anxiety, Computer -Adaptive Testing, and the Common Core

学内開発プレイスメントテスト得点解釈と使用の妥当性の評価について Evaluating Validity for In-House Placement Test Score Interpretations and Uses

The Impact of Multidirectional Item Parameter Drift on IRT Scaling Coefficients and Proficiency Estimates

From test validity to construct validity … and back?

Effects of diurnal variation on the Test of Variables of Attention performance in young adults with attention-deficit/hyperactivity disorder.

Effects of diurnal variation and caffeine consumption on Test of Variables of Attention (TOVA) performance in healthy young adults.

Effect of first-encounter pretest on pass/fail rates of a clinical skills medical licensure examination

Summative Assessment in Medicine: The Promise of Simulation for High‐stakes Evaluation