A program to evaluate general and conditional agreement among categorical assignments of many raters

Paul A Mcdermott,Marley W Watkins

doi:10.3758/bf03205686

Abstract

A variety of statistical procedures have been proposed for assessing degree and significance of agreement between raters in assignments of objects or subjects to nominal scales. Foremost among available techniques is the coefficient devised by Cohen (1960) and later refined by Light (1971). This statistic essentially represents normalized proportion of interrater agreement in excess of that expected on basis of chance or random assignments. In recent years, a number of useful programs have been developed for computer applications of K (Antonak, 1977; Berk & Campbell, 1976; Cicchetti, Lee, Fontana, & Dowds, 1978). However, such applications of coefficient are bound by two important constraints.. First, use of K is appropriate only when testing agreement between two raters: It is not appropriate for answering questions about conjoint agreement among many raters. For this reason, Light (1971) developed an extension of earlier statistic, known as m , which is coefficient of multiple. observer agreement. The second constraint to uses of K is found in assumption underlying both Cohen's (1960) original statistic and Light's (1971) later extension that respective pairs or sets of raters assigning various objects to categories will remain identical across all cases. In applied research settings, it is frequently circumstance that neither assumption holds because more than two independent raters may be involved in categorical assignment of each case, and sets of raters may vary as a function of time or convenience (as, e.g., when subjects in different levels of a treatment or educational program must be observed and rated in different settings on repeated occasions). Fleiss (1971) developed formulas that revise and extend K for use in situations where number of observers may be greater than two and where there is no assumption that sets of raters will remain constant t~roughout all cases. The resulting statistic may be Viewed as general or overall coefficient of agreement of many raters across all nominal categories. Fleiss also provided formulas to measure response agreement among many raters on each specific nominal category considered. This conditional coefficient is designedto test probability that randomly chosen raters assign any randomly selected object or subject to identical category. Such a conditional coefficient may be applied to evaluate integrity or viability of any given categorical value or classification. The program described in this paper calculates both general and conditional coefficients and tests statistical significance of agreement among many raters assigning objects to nominal scales based upon Fleiss's (1971) computational formulas. Input. Each analysis requires two control cards and a data card deck as follows: (1) a title card, (2) a problem card to specify number of cases being categorized, number of categories, and number of raters, and (3) a set of case cards, one card per case, specifying number of raters choosing each category. . Output. The information provided for each analysis Includes (1) an alphanumeric job title, (2) general percentage of agreement among raters before chance agreement is excluded, (3) value of general coefficient of agreement, (4) estimated variance and standard error for general coefficient, (5) ~alue of unit normal deviate and level of significance tor general coefficient, (6) conditional percentages of agreement among raters for each category prior to exclusion of chance, (7) values of conditional coefficients for each category, (8) variances and standard errors for each conditional coefficient and (9) unit normal deviates and significance levels for each conditional coefficient. Computer and Language. Written in FORTRAN IV program is compatible with machines in IBM 360 seri~s: Variables are in mnemonic form according to Fleiss s (1971) computational formulas. Input editing and output specifications are provided for user's syn tactical errors. Restrictions. Currently, program will permit up to 1,000 cases to be assigned by 100 or fewer raters to a maximum of 25 categories. Availability. A source listing, user's manual, and test input and output data may be obtained at no cost by writing to Paul A. McDermott, University of Pennsylvania, Graduate School of Education CI 3700 Walnut Street, Philadelphia, Pennsylvania 19104:

Full Text