Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence.
Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.
- Research Article
3
- 10.1002/sim.9694
- Mar 5, 2023
- Statistics in Medicine
Cohen's and Fleiss's kappa are popular estimators for assessing agreement among two and multiple raters, respectively, for a binary response. While additional methods have been developed to account for multiple raters and covariates, they are not always applicable, rarely used, and none simplify to Cohen's kappa. Furthermore, there are no methods to simulate Bernoulli observations under the kappa agreement structure such that the developed methods could be adequately assessed. This manuscript overcomes these shortfalls. First, we developed a model-based estimator for kappa that accommodates multiple raters and covariates through a generalized linear mixed model and encompasses Cohen's kappa as a special case. Second, we created a framework to simulate dependent Bernoulli observations that upholds all 2-tuple pair of rater's kappa agreement structure and includes covariates. We used this framework to assess our method when kappa was nonzero. Simulations showed that Cohen's and Fleiss's kappa estimates were inflated unlike our model-based kappa. We analyzed an Alzheimer's disease neuroimaging study and the classic cervical cancer pathology study. The proposed model-based kappa and advancement in simulation methodology demonstrates that the popular approaches of Cohen's and Fleiss's kappa are poised to yield invalid conclusions while our work overcomes shortfalls, leading to improved inferences.
- Research Article
1
- 10.1037/met0000732
- Aug 21, 2025
- Psychological methods
Cohen's kappa coefficient was introduced as a statistical measure to evaluate the degree of interrater agreement between two raters who classify each subject using categorical scales. Cohen posited that a certain level of agreement between raters is expected to occur by chance, and thus, kappa is designed to account for this expected chance agreement by adjusting the observed percent agreement. However, over time, several paradoxes and limitations have emerged in its interpretation, largely due to the underlying assumption of random chance agreement and its estimation. In this article, we propose that a portion of the observed percent agreement can be attributed to the interaction between category prevalence and the inherent characteristics of the categories themselves, such as their appeal, ambiguity, social desirability, or other factors related to the traits being measured. This prevalence-agreement effect can either positively or negatively influence the observed percent agreement. By moving away from the assumption of random assignment by raters, we derive a new coefficient of agreement that effectively removes the prevalence-agreement effect. We also discuss the significance of this new coefficient, its interpretation, and the stability of its estimation (standard error). (PsycInfo Database Record (c) 2025 APA, all rights reserved).
- Research Article
1
- 10.1097/wno.0000000000001425
- Oct 22, 2021
- Journal of Neuro-Ophthalmology
Lid fatigability test (LFT), Cogan lid twitch (CLT), and forced eyelids closure test (FECT) are simple clinical screening tests for ocular myasthenia gravis (OMG). However, these tests are subjectively interpreted. We thus evaluated the interobserver and intra-observer reliability of each test. The 3 eyelid tests were performed in ptotic patients associated with various conditions, including OMG and others. Video clips of all tests were recorded using smartphone with built-in camera in the following order; LFT, CLT, and FECT. All video clips were distributed to 3 neuro-ophthalmologists and 3 general ophthalmologists, who were trained to evaluate the tests using a single standard instruction. After 3 months, all video clips were re-organized for the second evaluation. Interobserver and intra-observer reliability were calculated using Cohens' Kappa coefficient and Fleiss Kappa statistic. The 3 eyelid tests were performed and recorded in 35 patients, which included the diagnosis of OMG, levator muscle dehiscence, partial oculomotor nerve palsy, and Horner syndrome. CLT received moderate-to-substantial interobserver reliability in neuro-ophthalmologist group (Fleiss Kappa 0.77 [95% CI 0.60-0.94] and 0.66 [95% CI 0.46-0.85] in first and second evaluation respectively), but the results varied in general ophthalmologist group (Fleiss Kappa 0.58 [95% CI 0.37-0.79] and 0.54 [95% CI 0.33-0.76] in first and second evaluation respectively). FECT and LFT received lower interobserver reliability in both groups. CLT also received moderate-to-almost perfect intra-observer reliability in neuro-ophthalmologist group (Cohen Kappa 0.55, 0.58, and 0.92), whereas FECT and LFT received lower intra-observer reliability. The intra-observer reliability varied among general ophthalmologists for all 3 eyelid tests. CLT is the most reliable test among the 3 eyelid tests. However, all tests should be interpreted with caution by general ophthalmologists.
- Abstract
- 10.1136/bjsports-2023-concussion.30
- Jan 1, 2024
- British Journal of Sports Medicine
ObjectiveTo determine the reliability of the Wheelchair Error Scoring System (WESS) among clinicians from multiple disciplines.DesignIntra- and inter-rater reliability.SettingGymnasium in the United States.ParticipantsFifteen (M=11, F=4) wheelchair basketball athletes over age...
- Research Article
83
- 10.2106/jbjs.oa.18.00020
- Oct 23, 2018
- JBJS Open Access
Background:There is no standardized complication classification system that has been evaluated for use in pediatric or general orthopaedic surgery. Instead, subjective terms such as major and minor are commonly used. The Clavien-Dindo-Sink complication classification system has demonstrated high interrater and intrarater reliability for hip-preservation surgery and has increasingly been used within other orthopaedic subspecialties. This classification system is based on the magnitude of treatment required and the potential for each complication to result in long-term morbidity. The purpose of the current study was to modify the Clavien-Dindo-Sink system for application to all orthopaedic procedures (including those involving the spine and the upper and lower extremity) and to determine interrater and intrarater reliability of this modified system in pediatric orthopaedic surgery cases.Methods:The Clavien-Dindo-Sink complication classification system was modified for use with general orthopaedic procedures. Forty-five pediatric orthopaedic surgical scenarios were presented to 7 local fellowship-trained pediatric orthopaedic surgeons at 1 center to test internal reliability, and 48 scenarios were then presented to 15 pediatric orthopaedic surgeons across the United States and Canada to test external reliability. Surgeons were trained to use the system and graded the scenarios in a random order on 2 occasions. Fleiss and Cohen kappa (κ) statistics were used to determine interrater and intrarater reliabilities, respectively.Results:The Fleiss κ value for interrater reliability (and standard error) was 0.76 ± 0.01 (p < 0.0001) and 0.74 ± 0.01 (p < 0.0001) for the internal and external groups, respectively. For each grade, interrater reliability was good to excellent for both groups, with an overall range of 0.53 for Grade I to 1 for Grade V. The Cohen κ value for intrarater reliability was excellent for both groups, ranging from 0.83 (95% confidence interval [CI], 0.71 to 0.95) to 0.98 (95% CI, 0.94 to 1.00) for the internal test group and from 0.83 (95% CI, 0.73 to 0.93) to 0.99 (95% CI, 0.97 to 1.00) for the external test group.Conclusions:The modified Clavien-Dindo-Sink classification system has good interrater and excellent intrarater reliability for the evaluation of complications following pediatric orthopaedic upper extremity, lower extremity, and spine surgery. Adoption of this reproducible, reliable system as a standard of reporting complications in pediatric orthopaedic surgery, and other orthopaedic subspecialties, could be a valuable tool for improving surgical practices and patient outcomes.
- Research Article
- 10.3390/jcm14020575
- Jan 17, 2025
- Journal of clinical medicine
Objective: Literature regarding osteochondral lesions in patients following elbow dislocation is scarce. The aim of this study was to examine osteochondral lesions on MRI in patients following simple elbow dislocations and evaluate inter-rater reliability between radiologists and orthopedic surgeons at different levels of experience. Methods: In this retrospective, single-center study, 72 MRIs of patients following simple elbow dislocations were evaluated. Ligamentous and osteochondral injuries were evaluated by a junior and senior radiologist and a junior and senior orthopedic surgeon. Osteochondral lesions were classified according to the Anderson classification, and their distribution was assessed. Inter-rater reliability was assessed using Cohen's Kappa (95% CI) and Fleiss' Kappa (95% CI). Results: The mean time from injury to MRI was 6.92 ± 4.3 days, and the mean patient age was 42.4 ± 16.0 years. A total of 84.5% of patients had a lateral collateral ligament tear, and 69.0% had a medial collateral ligament tear. Osteochondral lesions were found in 27.8% to 63.9% of cases. According to the senior orthopedic surgeon, 100% were first-grade lesions, whereas the senior radiologist classified 63.2% as first-grade, 26.3% as second-grade, and 5.3% as third- and fourth-grade lesions. Inter-rater reliability was fair to moderate for ligamentous injuries and fair for osteochondral lesions (Fleiss Kappa 0.25 [0.15-0.34]). Localization of the lesions differed depending on the examiner. For all examiners, osteochondral lesions of the lateral column (radial head and capitulum) were most common, with 57.8-66.7% of all lesions. Inter-rater reliability was moderate for lesions in the medial column (Fleiss Kappa 0.51 [0.41-0.6]) and fair for lesions in the lateral column (Fleiss Kappa 0.34 [0.24-0.43]). Conclusions: Osteochondral lesions following simple elbow dislocations are common; however, in contrast to the current literature, high-grade lesions seem to be relatively rare. Overall inter-rater reliability between radiologists and surgeons, as well as within surgeons, was only moderate to fair regarding ligament and osteochondral lesions.
- Discussion
- 10.1111/jth.14538
- Aug 1, 2019
- Journal of Thrombosis and Haemostasis
Method agreement analysis and interobserver reliability of the ISTH proposed definitions for effective hemostasis in the management of major bleeding: Methodological issues
- Research Article
- 10.1111/codi.70025
- Mar 1, 2025
- Colorectal disease : the official journal of the Association of Coloproctology of Great Britain and Ireland
Complete mesocolic excision (CME) for colon cancer has been associated with improved oncological outcomes but requires a detailed understanding of complex mesenteric vasculature. Three-dimensional (3D) reconstructed models derived from patient imaging could enhance preoperative anatomical comprehension, enabling safer, precision CME. In this two-phase, blinded, crossover study, four expert CME surgeons evaluated mesenteric vascular anatomy on CT scans and 3D models. In phase 1, surgeons assessed 66 cases, while 20 were re-evaluated in phase 2. The primary outcome measure was inter-rater reliability by Fleiss's kappa. Secondary outcomes were intra-rater reliability by Cohen's kappa and anatomical accuracy rates measured as a percentage of correct responses on a standardised questionnaire. In phase 1, inter-rater agreement was higher for 3D models (average kappa 0.6, moderate agreement) than for CT scans (average kappa 0.1, poor agreement). Ileocolic vein drainage and ileocolic artery trajectory showed the highest kappa values with 3D imaging (0.85 and 0.93, respectively). Accuracy was also superior with 3D across all surgeons (mean 89.7% correct) versus CT (mean 79.1% correct, P < 0.001). In phase 2, intra-rater reliability remained higher for 3D (average Cohen's kappa 0.61) than CT scans (Cohen's kappa 0.27). 3D mesenteric models significantly improve inter- and intra-rater reliability among CME experts over traditional CT scans while markedly enhancing anatomical comprehension accuracy about critical right-sided colonic vasculature. 3D planning could facilitate CME by enabling superior preoperative visualisation of these vessels.
- Research Article
2
- 10.1097/bpo.0000000000001130
- Apr 1, 2018
- Journal of Pediatric Orthopaedics
Growth-friendly surgery has high complication rates. The Complication Severity Score for growth-friendly surgery was developed to maintain consistency while reporting complications as part of research in this rapidly evolving field. This study evaluates the interrater and intrarater reliability of this complication classification system. After Institutional Review Board approval, complications during treatment for early onset scoliosis were identified from a prospectively collected database. Previous validation studies and a 10-case pilot survey determined that 60 cases were needed to represent a minimum of substantial agreement. In total, 63 of 496 cases were selected randomly to evenly represent each severity classification. The cases comprised an internet survey for classification sent to faculty and research coordinators involved in early onset scoliosis research, 3 weeks apart, with questions shuffled between iterations. Fleiss Kappa and Cohen Kappa were used to assess interrater and intrarater agreement, respectively. A total of 20 participants, 12 faculty and 8 research assistants, completed the survey twice. The overall Fleiss Kappa coefficient for interrater agreement from the second round of the survey was 0.86 (95% confidence interval, 0.86-87), which represents substantial agreement. Reviewers agreed almost perfectly on categorizing complications as Device I (0.85), Disease I (0.91), Disease II (0.94), Device IIB (0.92), and Disease IV (0.98). There was substantial agreement for categorizing Device IIA (0.73) and Device III (0.76) complications. Disease III and Device IV were not evaluated in this survey since none of these occurred in the database. There was almost perfect intrarater agreement among faculty (0.87), research coordinators (0.85), and overall (0.86). There is strong interrater and intrarater agreement for the published complications classification scheme for growing spine surgery. The complication classification system is a reliable tool for standardizing reports of complications with growth-friendly surgery. Adoption of this classification when reporting on growth-friendly surgery is recommended to allow for comparison of complications between treatment modalities. Level I-diagnostic study.
- Research Article
20
- 10.1177/10711007211058154
- Dec 1, 2021
- Foot & Ankle International
Historical concept of flatfoot as posterior tibial tendon dysfunction (PTTD) has been questioned. Recently, the consensus group published a new classification system and recommended renaming PTTD to Progressive Collapsing Foot Deformity (PCFD). The new PCFD classification could be effective in providing comprehensive information on the deformity. To date, there has been no study reporting intra- and interobserver reliability and the frequency of each class in PCFD classification. This was a single-center, retrospective study conducted from prospectively collected registry data. A consecutive cohort of PCFD patients evaluated from February 2015 to October 2020 was included, consisting of 92 feet in 84 patients. Classification of each patient was made using characteristic clinical and radiographic findings by 3 independent observers. Frequencies of each class and subclass were assessed. Intraobserver and inteobserver reliabilities were analyzed with Cohen kappa and Fleiss kappa, respectively. Mean sample age was 54.4, 38% was male and 62% were female. 1ABC (25.4%) was the most common subclass, followed by 1AC (8.7%) and 1ABCD (6.9%). Only a small percentage of patients had isolated deformity. Class A was the most frequent component (89.5%), followed by C in 86.2% of the cases. Moderate interobserver reliability (Fleiss kappa = 0.561, P < .001, 95% CI 0.528-0.594) was found for overall classification. Very good intraobserver reliability was found (Cohen kappa = 0.851, P < .001, 95% CI 0.777-0.926). Almost half (49.3%) of patients had a presentation dominantly involving the hindfoot (A) with various combinations of midfoot and/or forefoot deformity (B), (C) with or without subtalar joint involvement (D). The new system may cover all possible combinations of the PCFD, providing a comprehensive description and guiding treatment in a systematic and individualized manner, but this initial study suggests an opportunity to improve overall interobserver reliability. Level III, retrospective diagnostic study.
- Research Article
3
- 10.1016/j.neurol.2021.08.007
- Nov 14, 2021
- Revue Neurologique
Intra- and inter-rater consistency of dual assessment by radiologist and neurologist for evaluating DWI-ASPECTS in ischemic stroke
- Research Article
6
- 10.20319/lijhls.2017.33.115
- Nov 16, 2017
- LIFE: International Journal of Health and Life-Sciences
The assessment of consistency in the categorical or ordinal decisions made by observers or raters is an important problem especially in the medical field. The Fleiss Kappa, Cohen Kappa and Intra-class Correlation (ICC), as commonly used for this purpose, are compared and a generalised approach to these measurements is presented. Differences between the Fleiss Kappa and multi-rater versions of the Cohen Kappa are explained and it is shown how both may be applied to ordinal scoring with linear, quadratic or other weighting. The relationship between quadratically weighted Fleiss and Cohen Kappa and pair-wise ICC is clarified and generalised to multi-rater assessments. The AC coefficient is considered as an alternative measure of consistency and the relevance of the Kappas and AC to measuring content validity is explored.
- Research Article
- 10.1016/j.jvoice.2025.07.031
- Aug 1, 2025
- Journal of voice : official journal of the Voice Foundation
Reliability Study of GRBAS and CAPE-V Based on Large Samples in Chinese Context.
- Research Article
75
- 10.1111/bjd.16327
- Apr 19, 2018
- British Journal of Dermatology
Incontinence-associated dermatitis (IAD) is a specific type of irritant contact dermatitis with different severity levels. An internationally accepted instrument to assess the severity of IAD in adults, with established diagnostic accuracy, agreement and reliability, is needed to support clinical practice and research. To design the Ghent Global IAD Categorization Tool (GLOBIAD) and evaluate its psychometric properties. The design was based on expert consultation using a three-round Delphi procedure with 34 experts from 13 countries. The instrument was tested using IAD photographs, which reflected different severity levels, in a sample of 823 healthcare professionals from 30 countries. Measures for diagnostic accuracy (sensitivity and specificity), agreement, interrater reliability (multirater Fleiss kappa) and intrarater reliability (Cohen's kappa) were assessed. The GLOBIAD consists of two categories based on the presence of persistent redness (category 1) and skin loss (category 2), both of which are subdivided based on the presence of clinical signs of infection. The agreement for differentiating between category 1 and category 2 was 0·86 [95% confidence interval (CI) 0·86-0·87], with a sensitivity of 90% and a specificity of 84%. The overall agreement was 0·55 (95% CI 0·55-0·56). The Fleiss kappa for differentiating between category 1 and category 2 was 0·65 (95% CI 0·65-0·65). The overall Fleiss kappa was 0·41 (95% CI 0·41-0·41). The Cohen's kappa for differentiating between category 1 and category 2 was 0·76 (95% CI 0·75-0·77). The overall Cohen's kappa was 0·61 (95% CI 0·59-0·62). The development of the GLOBIAD is a major step towards a better systematic assessment of IAD in clinical practice and research worldwide. However, further validation is needed.
- Research Article
18
- 10.1371/journal.pone.0179092
- Jul 13, 2017
- PLoS ONE
Scoring reflex responsiveness and injury of aquatic organisms has gained popularity as predictors of discard survival. Given this method relies upon the individual interpretation of scoring criteria, an evaluation of its robustness is done here to test whether protocol-instructed, multiple raters with diverse backgrounds (research scientist, technician, and student) are able to produce similar or the same reflex and injury score for one of the same flatfish (European plaice, Pleuronectes platessa) after experiencing commercial fishing stressors. Inter-rater reliability for three raters was assessed by using a 3-point categorical scale (‘absent’, ‘weak’, ‘strong’) and a tagged visual analogue continuous scale (tVAS, a 10 cm bar split in three labelled sections: 0 for ‘absent’, ‘weak’, ‘moderate’, and ‘strong’) for six reflex responses, and a 4-point scale for four injury types. Plaice (n = 304) were sampled from 17 research beam-trawl deployments during four trips. Fleiss kappa (categorical scores) and intra-class correlation coefficients (ICC, continuous scores) indicated variable inter-rater agreement by reflex type (ranging between 0.55 and 0.88, and 67% and 91% for Fleiss kappa and ICC, respectively), with least agreement among raters on extent of injury (Fleiss kappa between 0.08 and 0.27). Despite differences among raters, which did not significantly influence the relationship between impairment and predicted survival, combining categorical reflex and injury scores always produced a close relationship of such vitality indices and observed delayed mortality. The use of the continuous scale did not improve fit of these models compared with using the reflex impairment index based on categorical scores. Given these findings, we recommend using a 3-point categorical over a continuous scale. We also determined that training rather than experience of raters minimised inter-rater differences. Our results suggest that cost-efficient reflex impairment and injury scoring may be considered a robust technique to evaluate lethal stress and damage of this flatfish species on-board commercial beam-trawl vessels.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.