The advantages of a rule assessment approach to the interpretation of achievement test results have been demonstrated using an S-P chart with coded error types. The problems of similar total test scores resulting from completely different misapprehensions, as well as correct answers resulting from incorrect rules of operation, were addressed using a simulated data-set. Although the overall quality of the test used here as measured by conventional psychometric indices proved satisfactory, it was shown that the traditional interpretation, which refers to total test scores, can be misleading, especially when adaptive remediation is sought. It is well known in medical sciences that a disease has several symptoms yet several diseases can share the same symptoms (i.e. high fever). Consequently, no responsible physician would prescribe the same medicine for two patients suffering from different diseases just because they both share high fever as one of their symptoms. Similarly, when two students with different misapprehensions get the same total test score, should the teacher prescribe the same remediation for correcting their misapprehension? Although the method for diagnostic test construction was out of the scope of this paper, it should be noted that test design is a crucial matter which eventually determines the quality of the diagnosis. One has to, therefore, carefully choose the items for the diagnosis in order to maximize the information about the rules of operation underlying the students' responses. A task specification chart (Birenbaum & Shaw, 1985) may serve as a useful tool in the process of test construction. As was illustrated in the chart, when an item yields the same results as a result of various “bugs”, its contribution to rule assessment is in question. Although in reality test results are contaminated by noise resulting from careless errors or strategy changes during the test, the overall identification rate achieved by diagnostic tests ranges between 70%–80% (Tatsuoka, 1984). Similarly, current AI diagnostic systems such as DEBUGGY and DPF are reported as being capable of identifying 80%–90% of student errors (VanLehn, 1981; Ohlesson & Langley, 1985). It seems that such a rate justifies the tedious work involved in constructing a diagnostic tool.
Read full abstract