Coefficient of agreement between two raters corrected for category prevalence: Alternative to kappa.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Cohen's kappa coefficient was introduced as a statistical measure to evaluate the degree of interrater agreement between two raters who classify each subject using categorical scales. Cohen posited that a certain level of agreement between raters is expected to occur by chance, and thus, kappa is designed to account for this expected chance agreement by adjusting the observed percent agreement. However, over time, several paradoxes and limitations have emerged in its interpretation, largely due to the underlying assumption of random chance agreement and its estimation. In this article, we propose that a portion of the observed percent agreement can be attributed to the interaction between category prevalence and the inherent characteristics of the categories themselves, such as their appeal, ambiguity, social desirability, or other factors related to the traits being measured. This prevalence-agreement effect can either positively or negatively influence the observed percent agreement. By moving away from the assumption of random assignment by raters, we derive a new coefficient of agreement that effectively removes the prevalence-agreement effect. We also discuss the significance of this new coefficient, its interpretation, and the stability of its estimation (standard error). (PsycInfo Database Record (c) 2025 APA, all rights reserved).

Similar Papers
  • Research Article
  • 10.1177/00131644251380540
Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence.
  • Nov 3, 2025
  • Educational and psychological measurement
  • Rashid Saif Almehrizi

Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.

  • Research Article
  • 10.12738/estp.2016.4.0080
Investigation of Coefficient of Individual Agreement in Terms of Sample Size, Random and Monotone Missing Ratio, and Number of Repeated Measures
  • Aug 5, 2016
  • Educational Sciences: Theory & Practice
  • Gülhan Orekici Temel + 3 more

(ProQuest: ... denotes formulae omitted.)Science and technology are rapidly advancing, and it is a major priority of most countries to keep up with these developments. Advanced technologies and microelectronics have developed in the last quarter of the 20th century and have found applications in all areas-this is believed to constitute the third stage of the Industrial Revolution. This stage has allowed immense progress in technological development and has made it possible to move forward from an industrial society to a knowledge society. In a knowledge society, information technologies are at the center of production and the economy. Advanced information technologies allowing easy access to all parts of the world have facilitated knowledge attainment, trading knowledge and passing it from one to another. This stage, in which stupendous progress has been possible, has been termed the Knowledge-Information Age. Rapid spread of inventions such as computers and the Internet in this era has expedited scientific and technological advances and facilitated significant industrialization (Erdogan, 2011).Countries' development can now be measured with the level of generated and transferred knowledge, and societies that can train individuals with high self-esteem and who can research and question are at the top of this league table. Training individuals with these qualifications is only possible with well-planned, high quality and sustainable educational initiatives. Such an education requires continuous and long-term assessment of individuals and includes the identification and removal of deficiencies in education. This is feasible through studies based on longitudinal data.Studies based on longitudinal data in general focus on the change and development in the situation that is being investigated and allow the examination of issues such as education, individual development, cultural change and socioeconomic development in time (Rajulton, 2001). In other words, data related to longitudinal studies are obtained from the same variable at different time frames and are intended to measure the development of the variable in time.Assessment instruments used for this purpose may include paper and pencil and psychokinetic tests for basic competences as well as instruments that will assess affective competencies. Taking repeated measures or obtaining scores from multiple raters may be necessary to ensure reliability of the measurements obtained through these instruments.However, reliability is observed as one of the weakest links in essays, verbal and kinetic examinations that involve multiple raters for gathering longitudinal data. Therefore, it is crucial to ensure inter-rater or intra-rater harmony in these studies for the reliability of scoring (Cohen, 1990; Guler & Gelbal, 2010; Lin, Hedayet, & Wu, 2012). In other words, when inter-rater reliability, measurement tools, and longitudinal studies are considered together, error-free measurement of the change and development observed in time will be seen as the function of assessment tools and inter-rater reliability.In this context, Haber, Gao, and Barnhart (2007) proposed that disagreement between data obtained from the same individuals towards the same variable with different methods is similar to disagreement between repeated measures data obtained from same individuals towards the same variable with different methods, and thus developed the coefficient of individual agreement (CIA), a function of disagreement in which one of the raters acts as a reference to measure agreement or disagreement between methods.The agreement calculations between raters differ according to the measurement level of measuring device (nominal, ordinal, interval etc.) and the number of raters (Carletta, 1996; Cohen, 1960). The most basic agreement coefficient is Cohen's kappa coefficient- an agreement coefficient in a measurement tool measured at the classification level of two raters. …

  • Research Article
  • Cite Count Icon 45
  • 10.1111/j.1365-277x.2006.00702.x
Parent and child reports of fruit and vegetable intakes and related family environmental factors show low levels of agreement
  • Aug 1, 2006
  • Journal of Human Nutrition and Dietetics
  • N I Tak + 3 more

The purpose of the present study was to determine the level of agreement between child and parent reports of 9- to 10-year-old children's consumption of fruit and vegetables and potential family-environmental determinants. Schoolchildren and their parents completed parallel questionnaires at baseline and at follow-up (1 year later) about usual fruit and vegetable consumption of the child, potential determinants and general demographics. Matched child-parent couples were included in the analyses (baseline = 380; follow-up = 307). To assess the level of agreement between child and parent reports at both points in time, dependent-sample t-test, correlation coefficients, weighted Cohen's kappa coefficients and Bland-Altman plots including limits of agreement were used. Both at baseline and at follow-up, the mean intake of fruit and vegetable reported by the children were significantly higher than reported by their parents, but differences were smaller at follow-up. Correlation coefficients between child and parent reports (0.28-0.43) and weighted Cohen's kappa coefficients (0.25-0.28) were weak to moderate. Limits of agreement were wide. The agreement between parent and child reports is weak to moderate and may depend on the age of the child. Fourth graders may overestimate their own intake of fruit and vegetables.

  • Research Article
  • Cite Count Icon 7
  • 10.4103/jcvjs.jcvjs_84_17
The efficacy of sagittal cervical spine subtyping: Investigating radiological classification methods within 150 asymptomatic participants
  • Jan 1, 2017
  • Journal of Craniovertebral Junction & Spine
  • Lee Daffin + 2 more

Aims:The aim of this study is to (1) compare and contrast cervical subtype classification methods within an asymptomatic population, and (2) identify inter-methodological consistencies and describe examples of inconsistencies that have the potential to affect subtype classification and clinical decision-making.Methods:A total of 150 asymptomatic 18–30-year-old participants met the strict inclusion criteria. An erect neutral lateral radiograph was obtained using standard procedures. The Centroid, modified Takeshima/Herbst methods and the relative rotation angles in cases of nonagreement were used to determine subtype classifications. Cohen's kappa coefficient (κ) was used to assess the level of agreement between the two methods.Results:Nonlordotic classifications represented 66% of the cohort. Subtype classification identified the cohort as, lordosis (51), straight (37), global kyphosis (30), sigmoidal (13), and reverse sigmoidal (RS) (19). Cohen's kappa coefficient indicated that there was only a moderate level of agreement between methods (κ = 0.531). Methodological agreement tended to be higher within the lordotic and global kyphotic subtypes whereas, straight, sigmoidal, and RS subtypes demonstrated less agreement.Conclusion:This is the first study of its type to compare and contrast cervical classification methods. Subtypes displaying predominantly extended or flexed segments demonstrated higher levels of agreement. Our findings highlight the need for establishing a standardized multi-method approach to classify sagittal cervical subtypes.

  • Research Article
  • Cite Count Icon 30
  • 10.1002/uog.11132
Visibility and measurement of Cesarean section scars in pregnancy: a reproducibility study
  • Nov 1, 2012
  • Ultrasound in Obstetrics & Gynecology
  • O Naji + 10 more

To evaluate the visibility of cesarean section (CS) scars by transvaginal sonography (TVS) in pregnant women, to apply a standardized approach for measuring CS scars and to test its reproducibility throughout the course of pregnancy. In this observational cohort study, 320 consecutive pregnant women with a previous cesarean delivery were examined to assess scar visibility by two independent examiners. TVS was carried out at 11-13, 19-21 and 34-36 weeks' gestation. A scar was defined as visible when an area of hypoechogenicity representing myometrial discontinuity at the anterior wall of the lower uterine segment was identified. In a subset of patients (n = 111), visible scars were measured by two independent examiners in three dimensions: scar width, depth and length as well as the residual myometrial thickness (RMT). Descriptive analysis was used to assess scar visibility, and the intraclass correlation coefficient (ICC) was calculated to show the strength of absolute agreement between two examiners for scar measurements. For RMT, a cut-off of 2.4 mm was used and measurement agreement was assessed using Cohen's kappa coefficient. The scar was visible in 284/320 cases (88.8%). Visible scars were significantly associated with anteverted uteri (P < 0.0001). Both examiners had 100% agreement on scar visibility at 12 and 20 weeks' gestation, while agreement was 96% at 34 weeks. The intra- and interobserver agreements for scar measurements were generally good (ICC 0.86 and 0.89, respectively). The kappa coefficient for the RMT was 0.27 in the first trimester, compared with 0.51 and 0.72 in the second and third trimesters, respectively. CS scars remain visible in the majority of women throughout pregnancy. They can be reproducibly measured in three dimensions when assessed by TVS in all trimesters of pregnancy. The agreement between two observers for CS scar measurement can be considered good for the first trimester, compared with relatively moderate agreement for the second and third trimesters.

  • Research Article
  • 10.1590/s0100-72032014000100009
Cytological smears of women diagnosed with adenocarcinoma of the uterine cervix
  • Jan 1, 2014
  • Revista brasileira de ginecologia e obstetricia : revista da Federacao Brasileira das Sociedades de Ginecologia e Obstetricia
  • Maria Isabel Do Nascimento + 1 more

To analyze the cytological findings of women with cervical adenocarcinoma, taking into account the patient's history in the year prior to diagnosis and the histopathological aspects of the lesions. A retrospective comparative study was conducted using data from women with cervical adenocarcinoma or squamous carcinoma detected between 2002 and 2008. The cytological reports were synthesized according to the Bethesda System revised in 2001 and were compared to the histopathological findings of cervical adenocarcinoma and squamous carcinoma. The distributions of cytological findings were calculated, as well as the global agreement and chance-corrected agreement using the Cohen's Kappa Coefficient. For this purpose, the cytological findings were grouped according to the epithelial origin, forming the glandular cell and squamous cell groups, with the histopathologically confirmed tumor types (adenocarcinoma versus squamous carcinoma) being used as the gold standard. A total of 284 cases of cervical cancer were diagnosed during the study period. The effectively studied cases were 27 and 54 patients with adenocarcinoma and squamous carcinoma, respectively. The adenocarcinoma group represented 9.5% of the total cases diagnosed, and 56.0% of the women in this group were younger than 50 years. Cervical cytology was collected on average 92 days before the cancer diagnosis (range: 19 days to 310 days). In 41.6% of cases the cytological results were consistent with glandular alterations such as adenocarcinoma cells or atypical glandular cells. The global agreement and Cohen's Kappa Coefficient were 73.7 and 48.7%, suggesting substantial and moderate agreement, respectively. In this population, the cytological smears had an important role in screening women with adenocarcinoma, although some of them were referred to clarify the clinical symptoms. The agreement between cytological and histopathological findings was moderate.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.1515/cclm-2021-0655
Comparison of the QuikRead go® point-of-care faecal immunochemical test for haemoglobin with the FOB Gold Wide® laboratory analyser to diagnose colorectal cancer in symptomatic patients.
  • Oct 25, 2021
  • Clinical Chemistry and Laboratory Medicine (CCLM)
  • William Maclean + 7 more

Faecal immunochemical testing for haemoglobin (FIT) is used to triage patients for colonic investigations. Point-of-care (POC) FIT devices on the market have limited data for their diagnostic accuracy for colorectal cancer (CRC). Here, a POC FIT device is compared with a laboratory-based FIT system using patient collected samples from the urgent referral pathway for suspected CRC. A prospective, observational cohort study. Patients collected two samples from the same stool. These were measured by POC QuikRead go® (Aidian Oy, Espoo, Finland) and laboratory-based FOB Gold Wide® (Sentinel Diagnostics, Italy). Faecal haemoglobin <10μg haemoglobin/g of faeces was considered as negative. At this threshold, comparisons between the two systems were made by calculating percentage agreement and Cohen's kappa coefficient. Proportion of negative results were compared with Chi squared testing. Sensitivities for CRC were calculated. A total of 629 included patients provided paired samples for FIT to compare the QuikRead go® and FOB Gold Wide®. The agreement around the negative threshold was 83.0% and Cohen's kappa coefficient was 0.54. The QuikRead go® reported 440/629 (70.0% of samples) as negative compared to 523/629 (83.1%) for the FOB Gold Wide®, this difference was significant (p-value<0.001). Sensitivities for CRC detection by the QuikRead go® and FOB Gold Wide® were 92.9% (95% confidence interval (CI): 68.5-98.7%) and 100% (CI: 78.5-100%) respectively. Both systems were accurate in their ability to detect CRC. Whilst good agreement around the negative threshold was identified, more patients would be triaged to further colonic investigation if using the QuikRead go®.

  • Research Article
  • Cite Count Icon 52
  • 10.1111/coa.12898
A cross-sectional evaluation of the validity of a smartphone otoscopy device in screening for ear disease in Nepal.
  • May 28, 2017
  • Clinical Otolaryngology
  • R Mandavia + 3 more

Hearing loss is a neglected international health problem. The greatest burden of ear disease is in low-income countries where there is also a lack of resources. In this context, screening for otological disease may be worthwhile. Cupris© has developed an otoscopy device that offers the possibility of low-cost mass screening in remote communities. We evaluated the validity of this device in diagnosing ear disease and in determining whether referral to an ENT centre is warranted. Cross-sectional study. Outpatient clinic, Nepal. All adults and children were invited to take part over a 2-day period. The Cupris© device was used to record participants otological history and examination. Stored history and images were assessed in the United Kingdom by a Consultant-grade ENT Surgeon, who provided a diagnosis and decided whether referral to an ENT centre was warranted. After screening with the Cupris© device, participants were immediately assessed by a UK trained ENT Consultant Surgeon using a standard otoscope ("standard assessment"). A diagnosis was recorded for each participant and a decision was made as to whether referral to an ENT centre was warranted. Concordance in primary diagnosis (analysed per ear) and concordance in the decision to refer (analysed per patient). Cohen's kappa coefficient for inter-rater agreement in diagnosis. Fifty-six patients agreed to participate. In four patients, the quality of video recorded precluded a diagnosis or management plan. These patients were excluded from subsequent analysis, leaving 52 patients for analysis. The same diagnosis was reached for 99 of 104 ears when comparing the Cupris© device to standard assessment (95% concordance), with Cohen's kappa coefficient of 0.89. The decision as to whether a patient should be referred to an ENT centre for further assessment was the same for all 52 participants when comparing the Cupris© device to standard assessment. When compared to standard assessment, the Cupris© device is a valid tool for the diagnosis of ear disease and decision for onward referral. It shows considerable promise for use by trained non-medical workers, as a low-cost and portable tool to screen for ear disease in remote settings, particularly in low- and middle-income countries.

  • Research Article
  • 10.1302/1358-992x.2025.12.069
AUTOMATED MULTIPLEX PCR FOR IDENTIFYING THE CAUSING PATHOGEN IN SEPTIC ARTHRITIS
  • Nov 4, 2025
  • Orthopaedic Proceedings
  • Lukas Rabitsch + 6 more

Aim Septic arthritis (SA) in adults can lead to serious complications if not diagnosed promptly. Conventional synovial fluid culture (CC) remains the gold standard for identifying the causing microorganism but is time-consuming and often insensitive, particularly in patients receiving antimicrobial therapy. This study aimed to evaluate the diagnostic performance of a novel automated multiplex PCR (mPCR) system and compared it to conventional culture (CC) in adults with suspected acute native joint infections. Method In this retrospective single-centre study, adult patients with suspected SA (February 2023-May 2024) were included. Diagnosis was based on institutional criteria incorporating clinical signs, synovial fluid cytological, microbiology, and histology. Agreement between mpCR and CC was assessed using overall percentage agreement and Cohen's Kappa coefficient. Diagnostic performance metrics were calculated for mPCR, CC, and their combined use. Results Of 143 included patients, 96 (67%) were diagnosed with SA. When considering mPCR-specific microorganisms, mPCR identified 13 additional microorganisms compared to CC. Nine of these (9/13) were diagnosed with SA and six of these (6/9, 67%) were on antibiotics prior to aspiration. Overall agreement between mPCR and CC was 91%, with a positive agreement of 100%, negative agreement of 88% and a Cohen's Kappa coefficient of 0.780. Considering all microorganisms (including off-panel organisms), the overall agreement was 89%, the positive agreement 92%, the negative agreement 88%, and the Cohen's Kappa 0.735. The mPCR demonstrated a sensitivity of 45% and specificity of 89%, while conventional culture showed a sensitivity of 40% and specificity of 100%. No significant difference in performance was observed between the two methods (p = 0.183). Moreover, the combined use (mPCR + CC) yielded a sensitivity of 48% and specificity of 89% (AUC = 0.686). Conclusions The novel automated mPCR system demonstrated a diagnostic performance similar to that of conventional synovial fluid culture, offering the added benefit of a quicker turnaround time, which can be crucial for patient care. Its use is especially evident in patients who have received prior antibiotic treatment, where conventional cultures may be less reliable.

  • Research Article
  • Cite Count Icon 30
  • 10.1111/j.1467-9574.2008.00412.x
Agreement between an isolated rater and a group of raters
  • Jan 16, 2009
  • Statistica Neerlandica
  • S Vanbelle + 1 more

The agreement between two raters judging items on a categorical scale is traditionally assessed by Cohen's kappa coefficient. We introduce a new coefficient for quantifying the degree of agreement between an isolated rater and a group of raters on a nominal or ordinal scale. The group of raters is regarded as a whole, a reference or gold‐standard group with its own heterogeneity. The coefficient, defined on a population‐based model, requires a specific definition of the concept of perfect agreement. It has the same properties as Cohen's kappa coefficient and reduces to the latter when there is only one rater in the group. The new approach overcomes the problem of consensus within the group of raters and generalizes Schouten's index. The method is illustrated on published syphilis data and on data collected from a study assessing the ability of medical students in diagnostic reasoning when compared with expert knowledge.

  • Research Article
  • Cite Count Icon 2
  • 10.3389/fonc.2025.1467664
High Ki67 expression, HER2 overexpression, and low progesterone receptor levels in high-grade DCIS: significant associations with clinical practice implications.
  • Jan 28, 2025
  • Frontiers in oncology
  • Hossein Schandiz + 6 more

We investigated the role of Ki67, a ubiquitous marker in cancer, within the context of ductal carcinoma in situ (DCIS), a precursor of invasive breast cancer. Through rigorous analysis of histopathological and immunopathological samples from a substantial cohort, this study revealed robust correlations between heightened Ki67 expression, diminished progesterone (PR) levels, and HER2 overexpression, indicative of aggressive DCIS phenotypes. These findings offer novel insights into the surrogate immunomolecular subtyping landscape of DCIS, potentially refining risk stratification and therapeutic approaches. This elucidation underscores the translational significance of Ki67 as a prognostic and predictive biomarker in DCIS, with implications for personalized treatment paradigms and patient outcomes. The Ki67 proliferation index is widely used in various tumors, including invasive breast carcinoma (IBC). However, its prognostic utility is often constrained by technical complexity. Its diagnostic and clinical significance in ductal carcinoma in situ (DCIS) remains uncertain. We studied Ki67 immunohistochemistry interobserver diagnostic agreement at different cutoff values in high-grade DCIS. Additionally, we investigated the associations between Ki67 expression, PR levels, and human epidermal growth factor receptor 2 (HER2) in high-grade DCIS among various subtypes (Luminal (Lum) A, LumB HER2-, LumB HER2+, HER2-enriched, and triple-negative)). Using histopathological specimens from 484 patients diagnosed with DCIS between 1996 and 2018, we implemented the 2013 St. Gallen recommendations for surrogate immunomolecular subtyping of IBC. Subtypes were classified, and the Ki67 interobserver diagnostic agreement between Counting Pathologist 1 (CP1) and CP2 was calculated using Cohen's kappa coefficient at various cutoff values. The Cohen's kappa coefficient for interobserver agreement between CP1 and CP2 was κ = 0.586, indicating moderate agreement. Ki67 levels varied significantly among subtypes (p < 0.0001), with a median Ki67% being higher in cases with invasive components (p = 0.0351). Low PR combined with high Ki67% was significantly associated with HER2 overexpression (p = 0.0107). Interobserver agreement for the Ki67 count was moderate. Ki67 expression showed considerable variability in high-grade DCIS. Low PR levels combined with high Ki67 expression were linked to HER2 overexpression, showing possible clinical implications for identifying high-risk DCIS.

  • Research Article
  • Cite Count Icon 33
  • 10.2169/internalmedicine.51.6718
Validity and Reliability Assessment of a Japanese Version of the Snaith-Hamilton Pleasure Scale
  • Jan 1, 2012
  • Internal Medicine
  • Hiroshi Nagayama + 18 more

Anhedonia is one of the main non-motor symptoms in Parkinson's disease (PD); it is assessed using the Snaith-Hamilton pleasure scale (SHAPS). To assess anhedonia in the Japanese population, we prepared a Japanese language version of SHAPS (SHAPS-J), and evaluated its validity and reliability in 8 neurological centers. Seventy subjects (48 patients with PD and 22 healthy subjects) were enrolled in this study. The validity of the test was assessed by the correlation between SHAPS-J and the apathy scale, based on the fact that anhedonia is considered a symptom of apathy syndrome. Test-retest reliability and internal consistency were assessed by Cohen's kappa and Cronbach's alpha coefficients, respectively. In the evaluation of validity, the total scores obtained on SHAPS-J during the test and retest significantly correlated with scores on Item 4 in Part 1 of the unified Parkinson's disease rating scale (p<0.0008 and p<0.0036, respectively). Cohen's kappa coefficient was >0.3 on all items (p<0.0005 on all items). Cronbach's alpha coefficient was 0.90 at the baseline and 0.88 at the retest. These results indicate that SHAPS-J has good validity, test-retest reliability, and internal consistency, thus establishing an available measure of anhedonia in Japanese.

  • Research Article
  • Cite Count Icon 38
  • 10.1016/j.jadohealth.2018.08.015
Parent and Adolescent Attitudes Towards Preventive Care and Confidentiality
  • Nov 3, 2018
  • Journal of Adolescent Health
  • Xiaoyu Song + 8 more

Parent and Adolescent Attitudes Towards Preventive Care and Confidentiality

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.arcped.2005.04.004
Adoption internationale : vision de deux pédiatres québécoises
  • Apr 30, 2005
  • Archives de pédiatrie
  • L Auger + 1 more

Adoption internationale : vision de deux pédiatres québécoises

  • Research Article
  • 10.3168/jds.2024-25940
Evaluation of a fully automated 2-dimensional imaging system for real-time cattle lameness detection using machine learning.
  • Apr 1, 2025
  • Journal of dairy science
  • N Siachos + 7 more

Early detection and prompt treatment of lame cows are crucial for proactive lameness management. This study aimed to evaluate a fully automated 2-dimensional imaging system for real-time lameness detection using artificial intelligence. Data were collected from 11 dairy farms in the UK Four trained veterinarians performed 42 mobility scoring sessions using a 0-3 4-grade scoring system, with scores 2 and 3 representing lameness. On each session, individual weekly average scores were calculated. This resulted in 40,116 paired human mobility scores (HMS) and weekly average mobility scores generated using artificial intelligence (AIMS) matched to a cow ID. Categorical agreement for the 4-grade scale was estimated by calculating the weighted Cohen's kappa (κw) and Gwet's agreement coefficient (AC2), and for the 2-grade scale (nonlame vs. lame) by calculating the percentage agreement (PA), unweighted Cohen's kappa (κ) and Gwet's coefficient (AC1). A trained veterinarian recorded the presence and severity of any lesion of 2,515 cows, which also had an AIMS assigned. A subset of 758 cows were also assigned an HMS 1-3 d before trimming. Sensitivity (Se), specificity (Sp), and accuracy (Acc) were calculated to describe the system's and human's ability to detect cows with foot lesions. Additionally, automated mobility scores were retrieved for cows with foot lesion records up to 30 d before trimming. Linear mixed effects models (LMM) were built to assess the association of the lesion status at trimming with the daily scores. The average (mAVG), maximum (mMAX), minimum (mMIN) and the percentage of scores that a cow was identified as lame (mPLS) during the 30 d before foot trimming were calculated and their Se, Sp and Acc in detecting foot lesions were determined. Lastly, longitudinal data were obtained from 143 cows tracking daily scores from 5 to 64 DIM. The association of lesion status at the early lactation routine trim (ELRT) with the daily scores was assessed by fitting LMM. Regarding the 4-grade scale agreement between HMS and AIMS, κw (0.24-0.34) represented fair agreement, whereas AC2 (0.81-0.93) almost perfect agreement. For the 2-grade scale agreement, PA was consistently above 80%, κ (0.23-0.38) represented fair agreement, and AC1 (0.76-0.83) showed substantial to almost perfect agreement. The AIMS detected cows bearing severe lesions with Se = 0.53 and Sp = 0.74, whereas the HMS achieved Se = 0.60 and Sp = 0.78. Using optimal thresholds for mAVG, mMAX, mMIN, and mPLS, the system achieved higher Se than HMS. Moreover, cows with severe lesions had increased scores from 23 d before trimming compared with cows with mild and moderate lesions. Longitudinal data showed that cows with severe lesions at ELRT had higher mobility scores during the first 60 DIM compared with those with mild or moderate lesions. Overall, the system's performance was comparable to that of experienced human assessors in detecting lame cows and cows with foot lesions. Finally, its capability to detect mobility changes before the development of severe lesions highlights its potential for early intervention, which could enhance lameness management in dairy herds.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.