Abstract

ABSTRACTThe equating of reasonably parallel forms of College Board Achievement Tests in Biology, Chemistry, Mathematics Level II, American History and Social Studies, and French is discussed in this paper. The results of five equating methods are compared: (1) Tucker, (2) Levine equally reliable, (3) Levine unequally reliable, (4) frequency estimation equipercentile, and (5) chained equipercentile. These methods are used with an internal common‐item anchor‐test data collection design. Three sampling strategies were evaluated: (1) random samples from populations similar in ability level, (2) random samples from populations dissimilar in ability level, and (3) samples from populations dissimilar in ability level that have been constructed to be similar in ability level by matching on the basis of a covariate, such as the distribution of scores on a set of common items. The criteria for comparison in all cases were the results of the Tucker procedure used with random samples from populations similar in ability level. These results were used as the criterion for equating results because they represent results obtained under the most optimal operational conditions.The results of the study indicate that it may be difficult, and in some cases impossible, to equate achievement tests using new‐ and old‐form samples obtained from populations that are different in ability level. All equating methods investigated in this study appear to be affected by group differences in ability. The equating methods that appear to be the most affected by these differences are the Tucker and frequency estimation equipercentile procedures. The methods that appear to be the most robust to group differences in ability are the chained equipercentile and the two Levine procedures.Matching on the basis of observed scores on a set of internal common items does not remedy the situation. In general, matching produces results, particularly scaled‐score means, for all equating procedures that are similar but that over‐ or underestimate the criterion scaled‐score means. Because the results (i.e., scaled‐score means) are similar across methods, the effect can be quite misleading in that, in the absence of a criterion, one could conclude that because consistent results are obtained across methods the results are close to “truth.” This was not found to be the case for the situations investigated in this study, and matching cannot be recommended as a procedure for rectifying the problem of sample ability‐level differences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call