Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.

Similar Papers
  • Front Matter
  • Cite Count Icon 24
  • 10.1097/00000542-200203000-00002
Measurement of pain in children: state-of-the-art considerations.
  • Mar 1, 2002
  • Anesthesiology
  • Brenda C Mcclain

Measurement of pain in children: state-of-the-art considerations.

  • Research Article
  • 10.4236/ojs.2024.145021
Interrater Reliability Estimation via Maximum Likelihood for Gwet's Chance Agreement Model.
  • Jan 1, 2024
  • Open journal of statistics
  • Alek M Westover + 2 more

Interrater reliability (IRR) statistics, like Cohen's kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen's kappa has been widely used, it has several limitations, prompting development of Gwet's agreement statistic, an alternative "kappa"statistic which models chance agreement via an "occasional guessing" model. However, we show that Gwet's formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen's kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa ( ). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The statistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous coefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.

  • Research Article
  • Cite Count Icon 1
  • 10.1037/met0000732
Coefficient of agreement between two raters corrected for category prevalence: Alternative to kappa.
  • Aug 21, 2025
  • Psychological methods
  • Rashid Saif Almehrizi

Cohen's kappa coefficient was introduced as a statistical measure to evaluate the degree of interrater agreement between two raters who classify each subject using categorical scales. Cohen posited that a certain level of agreement between raters is expected to occur by chance, and thus, kappa is designed to account for this expected chance agreement by adjusting the observed percent agreement. However, over time, several paradoxes and limitations have emerged in its interpretation, largely due to the underlying assumption of random chance agreement and its estimation. In this article, we propose that a portion of the observed percent agreement can be attributed to the interaction between category prevalence and the inherent characteristics of the categories themselves, such as their appeal, ambiguity, social desirability, or other factors related to the traits being measured. This prevalence-agreement effect can either positively or negatively influence the observed percent agreement. By moving away from the assumption of random assignment by raters, we derive a new coefficient of agreement that effectively removes the prevalence-agreement effect. We also discuss the significance of this new coefficient, its interpretation, and the stability of its estimation (standard error). (PsycInfo Database Record (c) 2025 APA, all rights reserved).

  • Research Article
  • Cite Count Icon 1
  • 10.1111/anzs.12421
Bayesian analysis of multivariate mixed longitudinal ordinal and continuous data.
  • Aug 13, 2024
  • Australian & New Zealand journal of statistics
  • Xiao Zhang

Multivariate longitudinal ordinal and continuous data exist in many scientific fields. However, it is a rigorous task to jointly analyse them due to the complicated correlated structures of those mixed data and the lack of a multivariate distribution. The multivariate probit model, assuming there is a multivariate normal latent variable for each multivariate ordinal data, becomes a natural modeling choice for longitudinal ordinal data especially for jointly analysing with longitudinal continuous data. However, the identifiable multivariate probit model requires the variances of the latent normal variables to be fixed at 1, thus the joint covariance matrix of the latent variables and the continuous multivariate normal variables is restricted at some of the diagonal elements. This constrains to develop both the classical and Bayesian methods to analyse mixed ordinal and continuous data. In this investigation, we proposed three Markov chain Monte Carlo (MCMC) methods: Metropolis--Hastings within Gibbs algorithm based on the identifiable model, and a Gibbs sampling algorithm and parameter-expanded data augmentation based on the constructed non-identifiable model. Through simulation studies and a real data application, we illustrated the performance of these three methods and provided an observation of using non-identifiable model to develop MCMC sampling methods.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.jvoice.2021.09.010
An Open Access Standardised Voice Evaluation Protocol
  • Oct 22, 2021
  • Journal of Voice
  • Luis M.T Jesus + 3 more

An Open Access Standardised Voice Evaluation Protocol

  • Research Article
  • Cite Count Icon 77
  • 10.1177/00220345860650031201
Examiner reliability in dental radiography.
  • Mar 1, 1986
  • Journal of Dental Research
  • R.W Valachovic + 4 more

In long-term investigations involving a large number of study participants, it is frequently necessary to employ the use of multiple examiners who must exhibit high levels of inter- and intra-examiner reliability in order to minimize examiner bias, which can distort scientific findings. This report on the calibration of four examiners in a large project investigating the efficacy of dental radiography shows high levels of examiner reliability using various statistical measures of agreement. Levels of intra-examiner agreement using Cohen's Kappa index were 0.75 and higher at baseline, and remained at approximately the same level (0.80) throughout the 24-month period of the study. The Kappa index of inter-examiner agreement among the six pairings of the four examiners ranged from 0.68 to 0.80 for caries and 0.72 and 0.83 for periodontal disease. Values of four statistical measures of agreement (proportional agreement, Kendall's rank correlation, Cohen's Kappa index, and Cohen's weighted Kappa index) were determined to show the importance of using measures, such as the Kappa index, which take chance agreement into account.

  • Research Article
  • Cite Count Icon 132
  • 10.1007/s10995-008-0384-7
Accuracy of body mass index categories based on self-reported height and weight among women in the United States.
  • Jul 8, 2008
  • Maternal and child health journal
  • Benjamin M Craig + 1 more

The purpose of this study was to assess the accuracy of BMI categories based on self-reported height and weight in adult women. BMI categories from self-reported responses were compared to categories measured during physical examination from women, age 18 or older, who participated in the National Health and Examination Survey, 1999-2004. We first examined strength of agreement using Cohen's kappa, which, unlike sensitivity and specificity, allows for the comparison of polychotomous measures beyond chance agreement. Kappa regression identifies potential threats to accuracy. Likelihood of bias, as measured by under-reporting, was examined using logistic regression. Cohen's kappa estimates were 0.443 for pregnant women (N = 724) and 0.705 for non-pregnant women (N = 5,910). Kappa varied by age and race, but was largely unrelated to socioeconomic status, health and health behaviors. Women who visited a physician in the last year or been diagnosed with osteoporosis were more accurate, while women most likely to under-report were older, white, non-Hispanic, and college-educated. Our results suggest substantial agreement between self-reported and measured categories, except for women who are pregnant, above the age of 75 or without physician visits. Under-reporting may be more prevalent in well-educated, white populations than minority populations.

  • Research Article
  • Cite Count Icon 110
  • 10.1207/s15328031us0203_03
Interrater Agreement Measures: Comments on Kappan, Cohen's Kappa, Scott's π, and Aickin's α
  • Aug 1, 2003
  • Understanding Statistics
  • Louis M Hsu + 1 more

The Cohen (1960) kappa interrater agreement coefficient has been criticized for penalizing raters (e.g., diagnosticians) for their a priori agreement about the base rates of categories (e.g., base rates of disorders). A modification of kappa, called kappan (alias S coefficient, C coefficient, G index, and RE coefficient) has been proposed as an alternative to Cohen's kappa: Kappan was intended to reward rather than penalize classification agreements attributable to interrater agreement about base rates. In this article, we show that kappan has some serious limitations: It can be large when raters who randomly assign objects (e.g., patients) to categories (diagnoses) radically disagree about base rates, and it can be much larger when these raters have very different beliefs about base rates than when they are in complete agreement about base rates. Contrary to the views of recent critics of Cohen's kappa, we argue that Cohen's kappa (which does not have these serious limitations) is generally preferable to...

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.ostima.2023.100164
Reliability of ultrasound-detected effusion-synovitis in knee osteoarthritis
  • Aug 11, 2023
  • Osteoarthritis Imaging
  • Lindsey A Macfarlane + 5 more

Reliability of ultrasound-detected effusion-synovitis in knee osteoarthritis

  • Research Article
  • Cite Count Icon 15
  • 10.1093/ptj/pzab150
The Conundrum of Kappa and Why Some Musculoskeletal Tests Appear Unreliable Despite High Agreement: A Comparison of Cohen Kappa and Gwet AC to Assess Observer Agreement When Using Nominal and Ordinal Data.
  • Jun 15, 2021
  • Physical Therapy
  • Michael T Cibulka + 1 more

In clinical practice, physical therapists often use different kinds of tests and measures in the assessment of their patients. For therapists to have confidence when using their tests and measures, an important attribute is having intratester and intertester reliability. Studies that assess reliability are cases of observer agreement. Many studies have been performed assessing observer agreement in the physical therapy literature. The most commonly used method to assess observer agreement studies that use nominal or ordinal data is the statistical method suggested by Cohen and the corresponding reliability coefficient, Cohen kappa. Recently, Cohen kappa has undergone scrutiny because of what is called kappa paradox, which occurs when observer agreement is high but the resulting kappa value is low. Another paradox also occurs when asymmetries exist between raters on their disagreements, resulting in a higher kappa value. In the physical therapy literature, there are numerous examples of this problem, which can often lead to misunderstanding the meaning of the data. This Perspective examines how and why these problems occur and suggests an alternative method for assessing observer agreement.

  • Research Article
  • 10.23736/s2724-5276.22.06462-x
Surgical validation of functional magnetic resonance urography in the study of ureteral anomalies distal to the uretero-pelvic junction in a pediatric cohort.
  • Jan 1, 2022
  • Minerva pediatrics
  • Fiammetta Sertorio + 10 more

Ureteral anomalies distal to the uretero-pelvic junction (UPJ) belong to the wide spectrum of congenital anomalies of the kidney and urinary tract (CAKUT). They can cause severe obstruction requiring a detailed anatomical depiction to define the surgical approach. Up to date, ultrasonography, voiding cystourethrography and scintigraphy are considered the gold-standard diagnostic tools to study obstructive anomalies of the urinary tract; however, they do not provide accurate ureteral anatomical details. The aim of our study was to evaluate the concordance between functional magnetic resonance urography (fMRU) and intraoperative findings to define ureteral anomalies distal to UPJ. Pediatric patients with ureteral anomalies distal to the UPJ who underwent surgery after performing fMRU were retrospectively collected. Surgical data were compared with radiological results. The concordance was assessed considering both pathological and non-pathological urinary tracts and was calculated by means of the Cohen's kappa coefficient. fMRU diagnostic accuracy was defined by sensitivity, specificity, and binomial exact confidence intervals. We included 46 patients. The sensitivity and specificity of fMRU were 98.0% and 83.3%; positive predictive value 90.4%, negative predictive value 96.2%. The concordance between surgical findings and fMRU was 92.3%, with a Cohen's k coefficient of 0.83 (excellent). Our study demonstrates the excellent agreement between fMRU and surgical findings in the definition of ureteral anomalies distal to the UPJ in children. Thus, it could be considered a valid imaging technique in the preoperative planning as it provides the surgeon with important information regarding the etiology and site of the obstruction.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.7759/cureus.57441
COVID-19-Associated Rhino-Orbito-Cerebral Mucormycosis: A Single Tertiary Care Center Experience of Imaging Findings With a Special Focus on Intracranial Manifestations and Pathways of Intracranial Spread.
  • Apr 2, 2024
  • Cureus
  • Megha G Nair + 2 more

Background and objective The COVID-19pandemic and mucormycosis epidemic in India made research on the radiological findings of COVID-19-associated mucormycosis imperative. This study aims todescribe the imaging findings in COVID-19-associated mucormycosis, with a special focus on the intracranial manifestations. Methodology Magnetic resonance imaging (MRI) scans of all patients with laboratory-proven mucormycosis and post-COVID-19 status, for two months, at an Indian Tertiary Care Referral Centre, were retrospectively reviewed, and descriptive statistical analysis was carried out. Results A total of 58 patients (47 men, 81%, and 11 women, 19%) were evaluated. Deranged blood glucose levels were observed in 47 (81%) cases. The intracranial invasion was detected in 31 (53.4%) patients. The most common finding in cases with intracranial invasion was pachymeningeal enhancement (28/31, 90.3%). This was followed by infarcts (17/31, 55%), cavernous sinus thrombosis (11/58, 18.9%), fungal abscesses (11/31, 35.4%), and intracranial hemorrhage (5/31, 16.1% cases). The perineural spread was observed in 21.6% (11/51) cases. Orbital findings included extraconal fat and muscle involvement, intraconal involvement, orbital apicitis, optic neuritis, panophthalmitis, and orbital abscess formation in decreasing order of frequency. Cohen's kappa coefficient of interrater reliability for optic nerve involvement and cavernous sinus thrombosis was 0.7. Cohen's coefficient for all other findings was 0.8-0.9. Conclusions COVID-19-associated rhino-orbito-cerebral mucormycosis has a plethora of orbital and intracranial manifestations. MRI, with its superior soft-tissue resolution and high interrater reliability, as elucidated in this study, is the imaging modality of choice for expediting the initial diagnosis, accurately mapping out disease extent, and promptly identifying and scrupulously managing its complications.

  • Research Article
  • Cite Count Icon 19
  • 10.1007/s11135-012-9807-z
Underlying determinants driving agreement among coders
  • Dec 17, 2012
  • Quality & Quantity
  • Guangchao Charles Feng

There are plenty of intercoder reliability indices, whereas the choice of them has been debated. With a Monte Carlo simulation, the determinants of the agreement indices were empirically tested. The chance agreement of Bennett’s S is found to be only affected by the number of categories. Consequently, S is a category based index. The chance agreements of Krippendorff’s \(\alpha \), Scott’s \(\pi \) and Cohen’s \(\kappa \) are affected by the marginal distribution, the level of difficulty and the interaction between them, and yet the difficulty level influences their chance agreements abnormally. The three indices are hence in general distribution based indices. Gwet’s \(AC_1\) reversed the direction of the three aforementioned indices, but its chance agreement is additionally affected by the number of categories and the interaction between the number of categories and the marginal distribution. \(AC_1\) can be classified into a class based on the number of categories, the marginal distribution and the level of difficulty. Both theoretical and practical implications were also discussed in the end.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 32
  • 10.1371/journal.pone.0061401
Substantial Agreement of Referee Recommendations at a General Medical Journal – A Peer Review Evaluation at Deutsches Ärzteblatt International
  • May 2, 2013
  • PLoS ONE
  • Christopher Baethge + 2 more

BackgroundPeer review is the mainstay of editorial decision making for medical journals. There is a dearth of evaluations of journal peer review with regard to reliability and validity, particularly in the light of the wide variety of medical journals. Studies carried out so far indicate low agreement among reviewers. We present an analysis of the peer review process at a general medical journal, Deutsches Ärzteblatt International.Methodology/Principal Findings554 reviewer recommendations on 206 manuscripts submitted between 7/2008 and 12/2009 were analyzed: 7% recommended acceptance, 74% revision and 19% rejection. Concerning acceptance (with or without revision) versus rejection, there was a substantial agreement among reviewers (74.3% of pairs of recommendations) that was not reflected by Fleiss' or Cohen's kappa (<0.2). The agreement rate amounted to 84% for acceptance, but was only 31% for rejection. An alternative kappa-statistic, however, Gwet's kappa (AC1), indicated substantial agreement (0.63). Concordance between reviewer recommendation and editorial decision was almost perfect when reviewer recommendations were unanimous. The correlation of reviewer recommendations and citations as counted by Web of Science was low (partial correlation adjusted for year of publication: −0.03, n.s.).Conclusions/SignificanceAlthough our figures are similar to those reported in the literature our conclusion differs from the widely held view that reviewer agreement is low: Based on overall agreement we consider the concordance among reviewers sufficient for the purposes of editorial decision making. We believe that various measures, such as positive and negative agreement or alternative Kappa values are superior to the application of Cohen's or Fleiss' Kappa in the analysis of nominal or ordinal level data regarding reviewer agreement. Also, reviewer recommendations seem to be a poor proxy for citations because, for example, manuscripts will be changed considerably during the revision process.

  • Research Article
  • Cite Count Icon 1
  • 10.7498/aps.65.077702
Numerical extraction of electric field distribution from thermal pulse method based on Monte Carlo simulation
  • Jan 1, 2016
  • Acta Physica Sinica
  • Liang Ming-Hui + 3 more

Thermal-pulse method is a powerful tool for measuring space charge distributions in polymer films. The data analysis for thermal-pulse method involves the Fredholm integral equation of the first kind, which requires an appropriate numerical procedure to obtain a solution. Various numerical techniques, including scale transformation and regulation method, are proposed. Of those numerical methods, the scale transformation (ST) is the simplest and the most widely used method. However, it presents a high spatial resolution only near the sample surface. Monte Carlo (MC) method is one of the recently proposed ways to solve the equation numerically and has been successfully applied to the analysis of laser intensity modulation method data, which also involves the Fredholm integral equation of the first kind. In this paper we attempt to analyze thermal-pulse data in frequency domain with the MC method and discuss its effectiveness based on some numerical simulations. The simulation results indicate that the electric field profiles can be effectively extracted by the MC method. The computed profiles by the MC method consist well with the supposed distributions in the entire thickness of the sample, while the profiles reconstructed by the ST method fit very well to the supposed one at the vicinity of the target surface and distort sharply along the direction of the thermal pulse propagation in the sample bulk. On the other hand, the oscillations in the computed results by the MC method could deteriorate its accuracy in this study. The influence of noise level on the analysis based on the MC method is also tested by the use of the simulated data. The results show that the computed profiles would become more fluctuant as the noise level increases. This problem can be solved by selecting a larger value of tolerance during the singular value decomposition procedure. Thus, the value of tolerance is considered to be one of the key parameters in this algorithm, which is actually hard to determine. Additionally, the experimental data obtained from a polypropylene film under applied electric field are analyzed to illustrate the feasibility of MC method to be applied to the thermal-pulse experimental data. The results also show that the spatial accuracy by the MC method in the entire sample thickness is higher than by the ST method, which verifies that the MC method is more suitable for detecting the electric field distribution in the deep bulk of the sample. Owing to noise and error, the accuracy of MC calculation depends on the chosen tolerance value, which is now considered to be an obstacle in applying this method to the practical thermal-pulse measurement.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.