Cognitive screening tests and items have been found to perform differently across groups that differ in terms of education, ethnicity and race. Despite the profound implications that such bias holds for studies in the epidemiology of dementia, little research has been conducted in this area. Using the methods of modern psychometric theory (in addition to those of classical test theory), we examined the performance of the Attention subscale of the Mattis Dementia Rating Scale. Several item response theory models, including the two- and three-parameter dichotomous response logistic model, as well as a polytomous response model were compared. (Log-likelihood ratio tests showed that the three-parameter model was not an improvement over the two-parameter model.) Data were collected as part of the ten-study National Institute on Aging Collaborative investigation of special dementia care in institutional settings. The subscale KR-20 estimate for this sample was 0.92. IRT model-based reliability estimates, provided at several points along the latent attribute, ranged from 0.65 to 0.97; the measure was least precise at the less disabled tail of the distribution. Most items performed in similar fashion across education groups; the item characteristic curves were almost identical, indicating little or no differential item functioning (DIF). However, four items were problematic. One item (digit span backwards) demonstrated a large error term in the confirmatory factor analysis; item-fit chi-square statistics developed using BIMAIN confirm this result for the IRT models. Further, the discrimination parameter for that item was low for all education subgroups. Generally, persons with the highest education had a greater probability of passing the item for most levels of theta. Model-based tests of DIF using MULTILOG identified three other items with significant, albeit small, DIF. One item, for example, showed non-uniform DIF in that at the impaired tail of the latent distribution, persons with higher education had a higher probability of correctly responding to the item than did lower education groups, but at less impaired levels, they had a lower probability of a correct response than did lower education groups. Another method of detection identified this item as having DIF (unsigned area statistic=3.05, p<0.01, and 2.96, p<0.01). On average, across the entire score range, the lower education group's probability of answering the item correctly was 0.11 higher than the higher education group's probability. A cross-validation with larger subgroups confirmed the overall result of little DIF for this measure. The methods used for detecting differential item functioning (which may, in turn, be indicative of bias) were applied to a neuropsychological subtest. These methods have been used previously to examine bias in screening measures across education and ethnic and racial subgroups. In addition to the important epidemiological applications of ensuring that screening measures and neuropsychological tests used in diagnoses are free of bias so that more culture-fair classifications will result, these methods are also useful for the examination of site differences in large multi-site clinical trials. It is recommended that these methods receive wider attention in the medical statistical literature.
Read full abstract