From Agreement to Epistemic Alignment: A Signal Detection-Theoretic Model of Inter-Rater Reliability.
Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen's κ, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, κ is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection-theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, κ can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that κ varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model-based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of κ and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.
- Research Article
57
- 10.3168/jds.2014-8129
- Jul 2, 2014
- Journal of Dairy Science
Effect of merging levels of locomotion scores for dairy cows on intra- and interrater reliability and agreement
- Research Article
29
- 10.3168/jds.2021-20503
- Aug 26, 2021
- Journal of Dairy Science
Evaluation of inter-rater agreement of the clinical signs used to diagnose bovine respiratory disease in individually housed veal calves
- Research Article
25
- 10.1111/jvim.15987
- Dec 7, 2020
- Journal of Veterinary Internal Medicine
BackgroundGrading of equine gastric ulcer syndrome (EGUS) is undertaken in clinical and research settings, but the reliability of EGUS grading systems is poorly understood.Hypothesis/ObjectivesInvestigate interobserver and intraobserver reliability of an established ordinal grading system and a novel visual analog scale (VAS), and assess the influence of observer experience.AnimalsSixty deidentified gastroscopy videos.MethodsSix observers (3 specialists and 3 residents) graded videos using the EGUS Council (EGUC) system and VAS. Observers graded the videos three 3 for each system, using a cross‐over design with at least 1 week between each phase. The order of videos was randomized for each phase.MethodsInterobserver and intraobserver reliability were estimated using Gwet's agreement coefficient with ordinal weights applied (AC2) for the EGUC system and the intraclass correlation coefficient (ICC) for the VAS.ResultsUsing the EGUC system, interobserver reliability was substantial for squamous (AC2 = 0.69; 95% confidence interval [CI], 0.57‐0.80) and glandular mucosa (AC2 = 0.72; 95% CI, 0.70‐0.75), and intraobserver reliability was substantial for squamous (AC2 = 0.80; 95% CI, 0.71‐0.90) and glandular mucosa (AC2 = 0.80; 95% CI, 0.74‐0.86). Interobserver reliability using the VAS was moderate for squamous (ICC = 0.64; 95% CI, 0.31‐0.96) and poor for glandular mucosa (ICC = 0.35; 95% CI, 0.06‐0.64), and intraobserver reliability was moderate for squamous (ICC = 0.74; 95% CI, 0.62‐0.86) and glandular mucosa (ICC = 0.56; 95% CI, 0.39‐0.72).Conclusions and Clinical ImportanceThe EGUC system had acceptable intraobserver and interobserver reliability and performed well regardless of observer experience. Familiarity and observer experience improved reliability of the VAS.
- Research Article
32
- 10.1017/s1041610214000052
- Feb 10, 2014
- International psychogeriatrics
Quality of life (Qol) is an increasingly used outcome measure in dementia research. The QUALIDEM is a dementia-specific and proxy-rated Qol instrument. We aimed to determine the inter-rater and intra-rater reliability in residents with dementia in German nursing homes. The QUALIDEM consists of nine subscales that were applied to a sample of 108 people with mild to severe dementia and six consecutive subscales that were applied to a sample of 53 people with very severe dementia. The proxy raters were 49 registered nurses and nursing assistants. Inter-rater and intra-rater reliability scores were calculated on the subscale and item level. None of the QUALIDEM subscales showed strong inter-rater reliability based on the single-measure Intra-Class Correlation Coefficient (ICC) for absolute agreement ≥ 0.70. Based on the average-measure ICC for four raters, eight subscales for people with mild to severe dementia (care relationship, positive affect, negative affect, restless tense behavior, social relations, social isolation, feeling at home and having something to do) and five subscales for very severe dementia (care relationship, negative affect, restless tense behavior, social relations and social isolation) yielded a strong inter-rater agreement (ICC: 0.72-0.86). All of the QUALIDEM subscales, regardless of dementia severity, showed strong intra-rater agreement. The ICC values ranged between 0.70 and 0.79 for people with mild to severe dementia and between 0.75 and 0.87 for people with very severe dementia. This study demonstrated insufficient inter-rater reliability and sufficient intra-rater reliability for all subscales of both versions of the German QUALIDEM. The degree of inter-rater reliability can be improved by collaborative Qol rating by more than one nurse. The development of a measurement manual with accurate item definitions and a standardized education program for proxy raters is recommended.
- Research Article
- 10.1093/bjs/znaf042.057
- Mar 12, 2025
- British Journal of Surgery
Introduction Surgeons' visual assessments represent the most common reason for non-utilisation of kidneys and livers retrieved for transplantation. While variability in organ acceptance between transplant centres is recognised, the influence of differences in subjective visual evaluations remains unclear. This study investigates inter-rater agreement amongst consultant transplant surgeons visually assessing organs for transplantation. Methods The study included 433 photographs (293 liver, 140 kidney), assessed by at least two transplant surgeons with over five years’ experience. Participants included eight surgeons from five centres for kidney photographs, and ten surgeons from six centres for liver photographs. Inter- and intra-rater agreements were determined with Gwet’s agreement coefficient (AC). Results There was excellent inter-rater agreement for liver steatosis across all levels (none/mild/moderate/severe), AC 0.81 (95%CI:0.76-0.86). Overall quality (good/moderate/poor) agreement was significantly lower in “good" vs “poor” quality livers, AC 0.86 (95%CI:0.75-0.97) vs AC 0.57 (95%CI:0.43-0.71), p<0.05. For kidney assessments, inter-rater agreement on global perfusion was excellent across all levels (good/fair/poor/patchy), AC 0.83 (95%CI:0.73-0.93). Agreement on overall quality was mixed, ranging from AC 0.94 (95%CI:0.73-1.00) for “good” kidneys, to AC 0.70 (95%CI:0.56-0.84) for “moderate” kidneys. Intra-rater agreement for both organs was consistently high (AC>0.86). Conclusions Despite excellent agreement among consultants on specific visual characteristics, there was a notable decrease in agreement when assessing overall quality. Clinicians concur on objective findings, but differ in how they weigh these observations when determining an organ's suitability for transplantation. These subjective differences may contribute to the variable organ acceptance rates across centres.
- Research Article
2
- 10.1093/pm/pnaa032
- Mar 13, 2020
- Pain medicine (Malden, Mass.)
Digital subtraction imaging (DSI) decreases the risk of intravascular injection during cervical transforaminal epidural steroid injection (CTFESI); however, sequence acquisition and interpretation are operator-dependent skills. This study tests the reliability of a grading system to determine adequate DSI during CTFESI. Academic tertiary medical center. A grading scheme for adequate DSI quality during CTFESI was created by the study authors based on patient positioning, mask image, and volume of contrast injected. The inter-rater and intrarater reliability values of this grading scheme were tested using 50 DSI images evaluated by three raters during two distinct sessions separated by four weeks. Based on a power analysis, a sample of 50 scans was sufficient to detect significant correlations. Inter-rater reliability was determined by percent agreement between graders for dichotomized categories of "quality of DSI is adequate for safe C-TFESI" vs "quality of DSI is inadequate for safe C-TFESI." The percentage of agreement was reported, along with Gwet's agreement coefficient (AC). The intrarater (pre/post) correlation was assessed using Yule's Q statistics. Correlation coefficients were interpreted as follows: 0.00-0.19 "very weak," 0.20-0.39 "weak," 0.40-0.59 "moderate," 0.60-0.79 "strong," and 0.80-1.00 "very strong." Inter-rater reliability analyses demonstrated that the patient position category had "very strong" agreement, contrast volume had "strong" agreement, and mask image had "moderate" agreement. The overall inter-rater reliability was "moderate." All of the raters demonstrated "very strong" intrarater reliability. The proposed grading system for adequate-quality DSI during CTFESI showed overall "moderate" and "very strong" inter- and intrarater reliability, respectively. This scheme provides an objective measure of DSI quality for CTFESI. Refinement is needed to improve the reliability of this scheme.
- Research Article
49
- 10.1093/jhps/hnaa064
- Mar 6, 2021
- Journal of Hip Preservation Surgery
To determine interobserver and intraobserver reliabilities of the combination of classification systems, including the Beck and acetabular labral articular disruption (ALAD) systems for transition zone cartilage, the Outerbridge system for acetabular and femoral head cartilage, and the Beck system for labral tears. Additionally, we sought to determine interobserver and intraobserver agreements in the location of injury to labrum and cartilage. Three fellowship trained surgeons reviewed 30 standardized videos of the central compartment with one surgeon re-evaluating the videos. Labral pathology, transition zone cartilage and acetabular cartilage were classified using the Beck, Beck and ALAD systems, and Outerbridge system, respectively. The location of labral tears and transition zone cartilage injury was assessed using a clock face system, and acetabular cartilage injury using a five-zone system. Intra- and interobserver reliabilities are reported as Gwet’s agreement coefficients. Interobserver and intraobserver agreement on the location of acetabular cartilage lesions was highest in superior and anterior zones (0.814–0.914). Outerbridge interobserver and intraobserver agreement was >0.90 in most zones of the acetabular cartilage. Interobserver and intraobserver agreement on location of transition zone lesions was 0.844–0.944. The Beck and ALAD classifications showed similar interobserver and intraobserver agreement for transition zone cartilage injury. The Beck classification of labral tears was 0.745 and 0.562 for interobserver and intraobserver agreements, respectively. The Outerbridge classification had almost perfect interobserver and intraobserver agreement in classifying chondral injury of the true acetabular cartilage and femoral head. The Beck and ALAD classifications both showed moderate to substantial interobserver and intraobserver reliabilities for transition zone cartilage injury. The Beck system for classification of labral tears showed substantial agreement among observers and moderate intraobserver agreement. Interobserver agreement on location of labral tears was highest in the region where most tears occur and became lower at the anterior and posterior extents of this region. The available classification systems can be used for documentation regarding intra-articular pathology. However, continued development of a concise and highly reproducible classification system would improve communication.
- Research Article
31
- 10.1378/chest.105.3.710
- Mar 1, 1994
- Chest
Assessment of interrater and intrarater reliability in the evaluation of metered dose inhaler technique.
- Research Article
1
- 10.1097/bpo.0000000000002495
- Sep 11, 2023
- Journal of pediatric orthopedics
Radiographic measurements of limb alignment in skeletally immature patients with anterior cruciate ligament (ACL) tears are frequently used for surgical decision-making, preoperative planning, and postoperative monitoring of skeletal growth. However, the interrater and intrarater reliability of these radiographic characteristics in this patient population is not well documented. Excellent reliability across 4 raters will be demonstrated for all digital measures of length, coronal plane joint orientation angles, mechanical axis, and tibial slope in skeletally immature patients with ACL tears. Cohort study (diagnosis). Three fellowship-trained orthopaedic surgeons and 1 medical student performed 2 rounds of radiographic measurements on digital imaging (lateral knee radiographs and long-leg radiographs) of skeletally immature patients with ACL tears. Intrarater and interrater reliability for continuous radiographic measurements was assessed with intraclass correlation coefficients (ICCs) across 4 raters with 95% CIs for affected and unaffected side measurements. Interrater reliability analysis used an ICC (2, 4) structure and intrarater reliability analysis used an ICC (2, 1) structure. A weighted kappa coefficient was calculated for ordinal variables along with 95% CIs for both interrater and intrarater reliability. Agreement statistic interpretations are based on scales described by Fleiss, and Cicchetti and Sparrow: <0.40, poor; 0.40 to 0.59, fair; 0.60 to 0.74, good; and >0.74, excellent. Radiographs from a convenience sample of 43 patients were included. Intrarater reliability was excellent for nearly all measurements and raters. Interrater reliability was also excellent for nearly all reads for all measurements. Radiographic reliability of long-leg radiographs and lateral knee x-rays in skeletally immature children with ACL tears is excellent across nearly all measures and raters and can be obtained and interpreted as reliable and reproducible means to measure limb length and alignment. Level III.
- Research Article
7
- 10.1016/j.math.2013.10.002
- Oct 24, 2013
- Manual Therapy
Anatomical landmark position – Can we trust what we see? Results from an online reliability and validity study of osteopaths
- Research Article
38
- 10.3168/jds.2014-9059
- Sep 19, 2015
- Journal of Dairy Science
Relation between observed locomotion traits and locomotion score in dairy cows
- Research Article
1
- 10.2340/jrm.v55.2409
- Mar 9, 2023
- Journal of Rehabilitation Medicine
ObjectiveWhen linking outcomes to the International Classification of Functioning, Disability and Health (ICF), inter-rater reliability is typically assessed at the conclusion of the linking process. This method does not allow for iterative evaluation and adaptations that would improve inter-rater reliability as novices gain experience. This pilot study aims to quantify the inter-rater reliability of novice linkers when using an innovative, sequential, iterative linking method to link prosthetic outcomes to the ICF.MethodsAcross 5 sequential rounds, 2 novices independently linked outcomes to the ICF. A consensus discussion followed each round that informed refinement of the customized ICF linking rules. The inter-rater reliability was calculated for each round using Gwet’s agreement coefficient (AC1).ResultsA total of 1,297 outcomes were linked across 5 rounds. At the end of round 1 inter-rater reliability was high (AC1 = 0.74, 95% confidence interval (95% CI) 0.68–0.80). At the end of round 3, inter-rater reliability (AC1 = 0.84, 95% CI 0.80–0.88) was significantly improved and marked the point of consistency where further improvements in inter-rater reliability were not statistically significant.ConclusionA sequential iterative linking method provides a learning curve that allows novices to achieve high-levels of agreement through consensus discussion and iterative refinement of the customized ICF linking rules.LAY ABSTRACTOutcomes are commonly used in healthcare and research to evaluate the effect of an intervention or treatment, such as the effect a prosthesis has on the ability to walk in the community or participate in activities. Cataloguing outcomes using well-established classification systems, such as the International Classification of Functioning, Disability and Health, is important, as it allows outcomes and research to be described using an internationally understood and agreed language. This study aimed to describe an innovative approach to cataloguing outcomes to the ICF, using a method that provides novices with a learning opportunity. In using this innovative method novices were able to catalogue outcomes to the ICF framework with a similar degree of reliability as experts. This will reduce the barriers to novices conducting this type of research in the future.
- Research Article
4
- 10.6114/jkood.2013.26.2.010
- May 25, 2013
- The Journal of Korean Oriental Medical Ophthalmology and Otolaryngology and Dermatology
Objectives : We performed a pilot study to investigate inter- and intra-rater reliability of pattern identification using nasal endoscopy for allergic rhinitis(AR). Methods : Eight experts of ophthalmology, otolaryngology and dermatology of Korean medicine evaluated 20 nasal endoscopy photograph cases of AR patients with pattern identification index using nasal endoscopy for AR including the nasal membrane color(pale / hyperemia), nasal membrane humidity(dryness / dampness), rhinorrhea(watery / yellow), and membrane edema (atrophic / edematous) on nasal endoscopy. Results : Intra-rater agreement(%) and Kappa coefficient was generally from 'moderate' to 'good'(% agreement: 73.13-90% / Kappa coefficient: 0.547-0.748). Inter-rater agreement(%) and Kappa coefficient was also from 'moderate' to 'good' (% agreement: 65-85% / Kappa: 0.475-0.778) except 'humidity(dryness / dampness)' item (% agreement: 55.98% / Kappa: 0.340). In findings of subgroup analysis according to affiliation of raters, Inter-rater agreement(%) and Kappa coefficient of raters in same affiliation was higher than inter-rater agreement(%) and Kappa coefficient of raters in different affiliation except 'dryness / dampness' item. Conclusions : It is necessary to improve objectivity and reproducibility of pattern identification using nasal endoscopy for allergic rhinitis(AR) through the development of detail-oriented criteria and enhanced training of clinicians with development of standard operating procedures(SOPs).
- Research Article
16
- 10.1111/jocn.12025
- Nov 5, 2012
- Journal of Clinical Nursing
To determine (1) What is the degree of interrater agreement and reliability of Glamorgan scale item and sum scores? and (2) Are Glamorgan scale sum scores valid? Pressure ulcer risk assessment scales are recommended for use in clinical practice. For paediatric patients, 12 instruments are currently described. Empirical evidence about the performance of Glamorgan scale scores in clinical practice is limited. An observational validation study was conducted on a paediatric cardiac unit of a large university hospital in Germany in April and May 2010. Children were assessed simultaneously and independently by varying convenience samples of three nurses per assessment situation. Pressure ulcer risk was measured by the Glamorgan scale and a 100 mm Visual Analogue Scale (VAS). Proportions of agreement (po ), multirater kappa and intraclass correlation coefficients were calculated. Thirty children were rated by 27 nurses. Median children's age was 5·5 years. Agreement among item scores was high, whereas reliability coefficients of item scores were low. Interrater reliability for the Glamorgan scale sum scores was higher than for VAS scores. Correlation between both scales was moderate. High agreement among item scores indicates that nurses are able to make precise judgements. The low interrater reliability of item and sum scores indicates that nurses were unable to differentiate the rated children based on their item and sum scores, thus providing little additional clinical relevant information about pressure ulcer risk in this setting. The Glamorgan scale and the VAS are unable to make clear distinctions in a low-risk setting. Therefore, it is unlikely that the tools in this setting provide additional information for clinical decision making. Both tools are not recommended for daily use.
- Research Article
1
- 10.21315/aos2023.1802.oa03
- Dec 20, 2023
- Archives of Orofacial Sciences
Dental Practicality Index (DPI) and American Association of Endodontists Endodontic Case Difficulty Assessment (AAECDA) form potentially can guide clinicians in making clinical decisions and triaging in large practices and academic settings. Nonetheless, the reliability and validity should be evaluated before institution-wide implementation. This study aimed to evaluate the inter-rater reliability of the DPI and AAECDA forms. Ten randomly selected, trained students rated 25 cases with both forms. The itemby-item inter-rater and overall reliability were estimated with Gwet’s agreement coefficient (AC2) and intraclass correlation coefficient (ICC), respectively. The association between clinical decisions and the scores was analysed with the Generalised Estimating Equation. The inter-rater reliability of DPI was generally very good (AC2 = 0.81–1.00), except context (good; AC2 = 0.718; 95% confidence interval [CI] = 0.575–0.861). The inter-rater reliability of AAECDA was generally very good (AC2 = 0.81–1.00) and good (AC2 = 0.61–0.80), except the radiographic appearance of the canal(s) (fair; AC2 = 0.424, 95% CI = 0.263–0.585). Moderate overall inter-rater reliability of AAECDA (ICC = 0.53, 95% CI = 0.38–0.70) and DPI (ICC = 0.62, 95% CI = 0.48–0.77) was observed. Referral to an endodontist was positively associated with AAECDA score (odds ratio [OR] = 1.323, 95% CI = 1.145–1.52, p < 0.001). The decision of tooth extraction was positively associated with the DPI score (OR = 1.983, 95% CI = 1.539–2.555; p < 0.001). In conclusion, DPI and AAECDA are methods with moderate inter-rater reliability when used among dental students.