Validity theory has evolved dramatically in the past few decades. The most prominent theory in recent years is an argument-based validity framework, proposed by Kane (1992, 2004, 2006). To evaluate test score interpretations and uses based on Kane’s framework, test developers first need to provide interpretive arguments and then validity arguments by proving sound warrants for the following four inferences: (a) scoring from observation to an observed score, (b) generalization from the observed score to the universe score, (c) extrapolation from the universe score to a target score, and (d) decision from the target score to use. In the field of language testing, a number of studies have been conducted to investigate the validity of test score interpretations and uses, especially for the ones considered to be high-stakes such as the TOEFL (Chapelle, 2008; Chapelle, Enright, & Jamieson, 2010). However, not many studies have been conducted to validate in-house placement test score interpretations and uses, and no study has evaluated the validity of such low-stakes tests using Kane’s validity framework. Regardless of whether the tests are high or low stakes, test developers need to be responsible for validating their test score interpretations and uses in order to attest to the validity. This study uses Kane’s (2006) argument-based validity framework to evaluate the validity of in-house placement test score interpretations and uses. The research questions are as follows: (a) to what extent do examinees get placement items correct and high-scoring examinees get more placement items correct; (b) to what extent are placement items consistently sampled from a domain sufficient in number so as to reduce measurement error; (c) to what extent do the difficulty of placement items match the objectives of a reading course; and (d) to what extent do placement decisions made to place examinees in their proper level of the course have an impact on washback in the course? An in-house placement test made up of 40-item grammar, 40-item vocabulary, and 10-item reading sections was developed and administered to 428 first-year private-university students in April 2010. The item format adopted was all multiple-choice so the answer sheets could be easily scored with a reader. Based on their test scores, about 60 high-scoring students and 50 low-scoring students were placed into one of two advanced or one of two basic reading classes. The remaining students were placed into one of several intermediate classes. A 55-item grammar achievement test was administered twice (once as a pretest and then again as a posttest) to the two basic and two intermediate classes. In addition, a 51-item class evaluation survey was administered to investigate students’ participation in the reading classes and to gauge students’ satisfaction with the classes and study support. Warrants for a validity argument of score inference were based on the results of the item analysis. A warrant for a validity argument of generalization inference was based on the composite generalizability coefficient of .92. A warrant for a validity argument of extrapolation inference was based on FACETS analysis, showing that difficulty estimates of learning levels were in an expected difficulty order. A warrant for validity arguments of decision inference was based on the basic-level students’ score gain on an achievement test and their positive reactions to a class evaluation survey. All the validity arguments presented in this study support the validity of the placement test score interpretations and uses. However, to further improve the validity of the test score interpretations and uses, it is necessary to investigate washback effects of the placement test in the reading classes and to revise the test to make grammar, vocabulary, and reading sections with 30 items each.
Read full abstract