Experimental Studies of Inter-Rater Agreement in Traditional Chinese Medicine: A Systematic Review.

Eric Jacobson,Monica Shields,Peter M Wayne,Rosa Schnyer,Dolma Tsering,Lisa Conboy,Patrick Mcknight

doi:10.1089/acm.2019.0197

Abstract

Objectives: It has been recommended that clinical trials of Traditional Chinese Medicine (TCM) would be more ecologically valid if its characteristic mode of diagnostic reasoning were integrated into their design. In that context, however, it is also widely held that demonstrating a high level of agreement on initial TCM diagnoses is necessary for the replicability that the biomedical paradigm requires for the conclusions from such trials. Our aim was to review, summarize, and critique quantitative experimental studies of inter-rater agreement in TCM, and some of their underlying assumptions. Design: Systematic electronic searches were conducted for articles that reported a quantitative measure of inter-rater agreement across a number of rating choices based on examinations of human subjects in person by TCM practitioners, and published in English language peer-reviewed journals. Publications in languages other than English were not included, nor those appearing in other than peer-reviewed journals. Predefined categories of information were extracted from full texts by two investigators working independently. Each article was scored for methodological quality. Outcome measures: Design features across all studies and levels of inter-rater agreement across studies that reported the same type of outcome statistic were compared. Results: Twenty-one articles met inclusion criteria. Fourteen assessed inter-rater agreement on TCM diagnoses, two on diagnostic signs found upon traditional TCM examination, and five on novel rating schemes derived from TCM theory and practice. Raters were students of TCM colleges or graduates of TCM training programs with 3 or more years experience and licensure. Type of outcome statistic varied. Mean rates of pairwise agreement averaged 57% (median 65, range 19-96) across the 9 studies reporting them. Mean Cohen's kappa averaged 0.34 (median 0.34, range 0.07-0.59) across the seven studies reporting them. Meta-analysis was not possible due to variations in study design and outcome statistics. High risks of bias and confounding, and deficits in statistical reporting were common. Conclusions: With a few exceptions, the levels of agreement were low to moderate. Most studies had significant deficits of both methodology and reporting. Results overall suggest a few design features that might contribute to higher levels of agreement. These should be studied further with better experimental controls and more thorough reporting of outcomes. In addition, methods of complex systems analysis should be explored to more adequately model the relationship between clinical outcomes, and the series of diagnoses and treatments that are the norm in actual TCM practice.

Full Text