Multiple Test Forms Research Articles

A linking design typically consists of a data collection procedure together with an item linking procedure that places item parameters calibrated from multiple test forms onto a common scale. This study considered 2 potentially useful item response theory linking designs. The first one is characterized by selecting a single set of common items across all multiple test forms, the precalibrated item parameters of which are kept fixed while the unknown parameters of the other items are being estimated. This linking design will be referred to as the fixed common-precalibrated item parameter design. However, data collected under this design could also be analyzed by the characteristic curve method, which constituted an alternative linking procedure. In this study, the relative merits of the 2 linking designs were examined with respect to their robustness against 3 manipulated conditions-namely, when the common items have imprecise estimates, when there is a noticeable difference in the average item difficulty between the common and the noncommon items, and when the examinees are heterogeneous in terms of their abilities. A parameter recovery study was conducted to achieve this purpose. The results indicated that both linking designs were capable of producing accurate linking of items and equivalent estimation of ability parameters under the 3 conditions. When the 2 designs were actually utilized in the development of an item bank, it was found that both linking designs produced quite consistent solutions despite minor differences on some item and ability estimates. Condition under which a linking design is preferred over the other is also provided in the Discussion section of this article.

The process of rating written language performance is still not well understood, despite a body of work investigating this issue over the last decade or so (e.g., Cumming, 1990; Huot, 1990; Vaughan, 1991; Weigle, 1994a; Milanovic et al., 1996). The purpose of this study is to investigate the process by which raters of texts written by ESL learners make their scoring decisions using an analytic rating scale designed for multiple test forms. The context is the Special Test of English Proficiency ( step), which is used by the Australian government to assist in immigration decisions. Four trained, experienced and reliable step raters took part in the study, providing scores for two sets of 24 texts. The first set was scored as in an operational rating session. Raters then provided think-aloud protocols describing the rating process as they rated the second set. A coding scheme developed to describe the think-aloud data allowed analysis of the sequence of rating, the interpretations the raters made of the scoring categories in the analytic rating scale, and the difficulties raters faced in rating. Data show that although raters follow a fundamentally similar rating process in three stages, the relationship between scale contents and text quality remains obscure. The study demonstrates that the task raters face is to reconcile their impression of the text, the specific features of the text, and the wordings of the rating scale, thereby producing a set of scores. The rules and the scale do not cover all eventualities, forcing the raters to develop various strategies to help them cope with problematic aspects of the rating process. In doing this they try to remain close to the scale, but are also heavily influenced by the complex intuitive impression of the text obtained when they first read it. This sets up a tension between the rules and the intuitive impression, which raters resolve by what is ultimately a somewhat indeterminate process. In spite of this tension and indeterminacy, rating can succeed in yielding consistent scores provided raters are supported by adequate training, with additional guidelines to assist them in dealing with problems. Rating requires such constraining procedures to produce reliable measurement.

Multiple Test Forms Research Articles

Related Topics

Articles published on Multiple Test Forms

Investigating the Impact of Compromised Anchor Items on IRT Equating Under the Nonequivalent Anchor Test Design

Item Selection for the Development of Parallel Forms From an IRT-Based Seed Test Using a Sampling and Classification Approach

Bees Algorithm for Construction of Multiple Test Forms in E-Testing

Exploring Domain-General and Domain-Specific Linguistic Knowledge in the Assessment of Academic English Language Proficiency

Automated Test Assembly Using lp_Solve Version 5.5 in R

Item Difficulty Predictors of a Multiple-choice Reading Test

An NCME Instructional Module on Booklet Designs in Large‐Scale Assessments of Student Achievement: Theory and Practice

The Bookmark Procedure for Setting Cut-Scores and Finalizing Performance Standards: Strengths and Weaknesses

A Comparison of Using the Fixed Common-Precalibrated Parameter Method and the Matched Characteristic Curve Method for Linking Multiple-Test Items

Detecting score drift in a high-stakes performance-based assessment.

Assessment criteria in a large-scale writing test: what do they really mean to the raters?

Optimal Assembly of Psychological and Educational Tests

Simultaneous Assembly of Multiple Test Forms

The "Unbiased" Anchor: Bridging the Gap Between DIF and Item Bias

Test-retest reliability of computerized, everyday memory measures and traditional memory tests

Multiple equivalent test forms in a computerized, everyday memory battery

Multiple equivalent test forms in a computerized, everyday memory battery

Effect of Examinee Group on Equating Relationships

Best linear prediction of composite universe scores

SOME EXPLORATORY INDICES FOR SELECTION OF A TEST EQUATING METHOD

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multiple Test Forms Research Articles

Related Topics

Articles published on Multiple Test Forms

Investigating the Impact of Compromised Anchor Items on IRT Equating Under the Nonequivalent Anchor Test Design

Item Selection for the Development of Parallel Forms From an IRT-Based Seed Test Using a Sampling and Classification Approach

Bees Algorithm for Construction of Multiple Test Forms in E-Testing

Exploring Domain-General and Domain-Specific Linguistic Knowledge in the Assessment of Academic English Language Proficiency

Automated Test Assembly Using lp_Solve Version 5.5 in R

Item Difficulty Predictors of a Multiple-choice Reading Test

An NCME Instructional Module on Booklet Designs in Large‐Scale Assessments of Student Achievement: Theory and Practice

The Bookmark Procedure for Setting Cut-Scores and Finalizing Performance Standards: Strengths and Weaknesses

A Comparison of Using the Fixed Common-Precalibrated Parameter Method and the Matched Characteristic Curve Method for Linking Multiple-Test Items

Detecting score drift in a high-stakes performance-based assessment.

Assessment criteria in a large-scale writing test: what do they really mean to the raters?

Optimal Assembly of Psychological and Educational Tests

Simultaneous Assembly of Multiple Test Forms

The "Unbiased" Anchor: Bridging the Gap Between DIF and Item Bias

Test-retest reliability of computerized, everyday memory measures and traditional memory tests

Multiple equivalent test forms in a computerized, everyday memory battery

Multiple equivalent test forms in a computerized, everyday memory battery

Effect of Examinee Group on Equating Relationships

Best linear prediction of composite universe scores

SOME EXPLORATORY INDICES FOR SELECTION OF A TEST EQUATING METHOD