Articles published on Interrater Reliability
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
39276 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.jad.2025.121095
- Apr 1, 2026
- Journal of affective disorders
- Yang S Liu + 17 more
Doctors can agree: Enhancing interrater reliability of mental health diagnosis among junior psychiatrists using electronic clinician assisting technology.
- New
- Research Article
- 10.1016/j.jamda.2025.106105
- Apr 1, 2026
- Journal of the American Medical Directors Association
- Peiyuan Zhang + 5 more
Psychometric Analysis of an Advance Care Planning Implementation Quality Assessment Tool (ACP-QAT) for Nursing Homes.
- New
- Research Article
- 10.1016/j.injury.2026.113115
- Apr 1, 2026
- Injury
- Vanessa Morello + 8 more
High-energy pelvic ring injuries: Are standard anteroposterior x-rays still relevant in the CT era?
- New
- Research Article
- 10.1016/j.jor.2026.01.013
- Apr 1, 2026
- Journal of orthopaedics
- Maximilian Appel + 2 more
This study aims to compare the interobserver reliability of the traditional two-dimensional AO/OTA classification with the more recent three-dimensional Luo three-column classification for tibial plateau fractures. Furthermore, it evaluates the impact of both systems on the surgical approach selection, particularly examining Luo et al.'s hypothesis that the three-column classification encourages increased consideration of the posterior column during preoperative planning. However, this hypothesis has not been evaluated yet, leaving a research gap regarding its influence in practice on surgical approach selection. Fifteen cases of tibial plateau fractures were retrospectively analyzed by nine trauma surgeons using radiographs and CT scans. Fractures were classified according to the AO/OTA and Luo classifications, and preferred surgical approaches for definitive fixation were determined. Interobserver reliability was assessed using Fleiss' kappa and interpreted according to the categorical rating by Landis and Koch. Additionally, a chi-square test was performed to evaluate statistical significance in the surgical approach selection. Both classification systems showed overall substantial reliability (kAO=0.63; kLuo=0.67). The difference in agreement for surgical approach groups between the two classifications was 0.11 (kAO_approach=0.37; kLuo_approach=0.48). The posterior approach group was not selected significantly more often using the Luo three-column classification compared to the AO/OTA classification (p=0.543). No significant difference in interobserver reliability or in the choice of surgical approach was observed between the AO/OTA and Luo classifications.
- New
- Research Article
- 10.1016/j.nedt.2025.106968
- Apr 1, 2026
- Nurse education today
- Pao-Ju Chen
AI-enhanced virtual reality simulation for nursing students' empathy: Automated scoring and inter-rater reliability in a randomised controlled study.
- New
- Research Article
- 10.1016/j.micpath.2026.108376
- Apr 1, 2026
- Microbial pathogenesis
- Veena Mishra + 4 more
Toxoplasma gondii in slaughtered sheep: a study of parasite prevalence, isolation, genotyping, virulence, and potential health risks to butchers in India.
- New
- Research Article
- 10.1016/j.acap.2025.103179
- Apr 1, 2026
- Academic pediatrics
- Rebecca Valek + 6 more
Addressing Risks of Violence to Children and Adolescents Through Oregon's Extreme Risk Protection Order Law.
- New
- Research Article
1
- 10.1016/j.fas.2025.10.008
- Apr 1, 2026
- Foot and ankle surgery : official journal of the European Society of Foot and Ankle Surgeons
- Julien Paquot + 2 more
MRI assessment of graft maturation after arthroscopic anatomical lateral ankle ligament reconstruction: One-year comparison between autograft and allograft.
- New
- Research Article
- 10.1111/aas.70221
- Apr 1, 2026
- Acta anaesthesiologica Scandinavica
- Luan Bicalho Costa + 5 more
The American Society of Anesthesiologists Physical Status (ASA-PS) classification system is ubiquitous in perioperative medicine and research as a tool for preoperative patient risk stratification. Despite widespread clinical adoption as a predictor of perioperative outcomes, the ASA-PS system is inherently subjective, leading to considerable inter-rater variability. A comprehensive mapping of the literature examining the relationship between ASA-PS scores and patient outcomes is lacking. To systematically map the extent, range, and nature of peer-reviewed literature examining the relationship between the ASA-PS classification and patient outcomes, and to identify key characteristics, themes, and knowledge gaps in this evidence base. This scoping review will be conducted according to the Joanna Briggs Institute (JBI) methodological framework and reported using the Preferred Reporting Items for Systematic Review and Meta-Analysis extension for Scoping Reviews (PRISMA-ScR). The Population-Concept-Context (PCC) framework will guide eligibility assessment. A comprehensive search will be conducted across PubMed, EMBASE, Scopus, LILACS, and the Cochrane Central Register of Controlled Trials, with no language or date restrictions. Study selection will be performed independently and in duplicate by two reviewers in two stages (title/abstract screening, full-text review). If any discordance appears, a third reviewer verdict will be requested. Data will be extracted using a structured charting form and synthesized narratively. Any healthcare setting where an ASA-PS score is assigned prior to a procedure (inpatient hospital, ambulatory surgery center, outpatient clinic). Primary research designs, including randomized controlled trials, observational studies (cohort, case-control, cross-sectional, descriptive), and case reports will be eligible; review articles, editorials, letters to the editor, and commentaries will be excluded. The search will employ controlled vocabulary (MeSH terms) and free-text keywords including: "ASA score," "ASA Physical Status Classification System," "American Society of Anesthesiologists," in combination with outcome-related terms. Supplementary hand searching of reference lists and Google Scholar will be performed. Study characteristics (author, year, country, journal, design), population characteristics (sample size, age, comorbidity), context (clinical setting, specialty, procedure type, urgency), ASA score details, and outcome details (including statistical methods used to derive associations) will be extracted. A preliminary data charting form is provided in Appendix B. Narrative synthesis supported by descriptive statistics will map study characteristics, outcome categories, clinical contexts, study designs, and temporal and geographical distribution of research. No formal quality appraisal will be conducted. Ethics committee approval is not required for this protocol-based scoping review.
- New
- Research Article
- 10.1016/j.mri.2025.110578
- Apr 1, 2026
- Magnetic resonance imaging
- Kevin Sun Zhang + 13 more
To assess variability of maximum diameter measurements of prostate lesions in MRI assessing patient repositioning, rater and sequence effects. Forty-two patients were included retrospectively, who received a clinical bi-/multiparametric prostate MRI examination and agreed to have the T2-weighted (T2WI) and diffusion weighted-imaging (DWI) sequences scanned twice. Maximum diameter measurements of prostate lesions mentioned in the clinical radiologist reports were performed by four readers in multiple reading sessions for determination of inter-sequence (between two DWI sequences), inter-scan (between clinical and additional scan), intra-rater and inter-rater variability. The primary calculated metrics were the repeatability and reproducibility coefficient (RC/RDC), including pooled RC/RDC. Variability measured by RCs/RDCs was lowest for measurements obtained within the same reading session, with inter-scan RCs up to 5.6mm/6.5mm for T2WI/DWI, pooled RCs of 4.8mm/5.8mm, respectively, and inter-sequence RDCs of 5.4mm-5.9mm, pooled RDC 5.8mm. Measurements performed in separate reading sessions demonstrated significantly higher variability for both settings in the majority of cases (RCs: up to 10.9mm/11.7mm/10.2mm for T2WI/DWI/inter-sequence, p≤0.002), pooled RCs/RDCs 9.2mm-9.9mm. Measurements necessarily generated in different reading sessions, i.e., intra-rater or inter-rater, demonstrated high variability (RCs/RDCs up to 11.4mm/11.5mm for T2WI/DWI). Prostate lesion measurements demonstrate considerable variability. When measured in one reading session by one rater, lesion diameter differences below the pooled RCs of 4.8mm, 95%-CI [3.9, 5.6] for T2WI and 5.8mm, 95%-CI [4.7, 7.1] for DWI should not necessarily assumed to be true biological change, as these differences may result from measurement- or repositioning-based variability alone. Caution needs to be taken assessing size changes.
- New
- Research Article
- 10.1016/j.clnesp.2026.102917
- Apr 1, 2026
- Clinical nutrition ESPEN
- Lana M Agraib + 3 more
Comparing the eating attitudes test (EAT -26) and disorder examination questionnaire (EDE-Q) as screening tools for eating disorders among young adults: A population-specific analysis.
- New
- Research Article
- 10.1016/j.gerinurse.2026.103855
- Apr 1, 2026
- Geriatric nursing (New York, N.Y.)
- Alex Chanteclair + 6 more
Transcultural adaptation of a French version the quality of life in late-stage dementia (QUALID) scale for older adults with severe cognitive impairment: A preliminary study and research perspectives.
- New
- Research Article
- 10.1002/pan.70112
- Apr 1, 2026
- Paediatric anaesthesia
- Lucy Liu + 4 more
The American Society of Anesthesiologists Physical Status (ASA-PS) classification system is widely used to classify patient comorbidities prior to surgery and is often used as a marker of perioperative risk. Since its inception in 1941, it has undergone modifications to adapt to changing clinical needs and to improve its reliability. In 2020, a version of the ASA-PS was released with pediatric-specific case examples. To explore inter-rater reliability in ASA-PS scoring in the pediatric population. This single-center retrospective study evaluated the assigned ASA-PS scores of 364 patients at a quaternary pediatric hospital. Each patient was assigned three ASA-PS scores-one by the case anesthetist and one each by two independent consultant anesthetists using the ASA guidance issued in 2020. Concordance was measured between the assigned scores, and potential reasons for discordant scores were identified. There was strong concordance of ASA-PS scores between the two independently scoring anesthetists (weighted kappa coefficient 0.76), but only moderate concordance between the case anesthetist and the independent anesthetists (weighted kappa coefficient 0.5). Where there was a discrepancy, the case anesthetist had usually underscored the ASA-PS by 1 point. Patients who had symptomatic cardiac disease, abnormal body mass index for age, an oncologic state, brain malformation, or a difficult airway were more likely to be assigned an incorrect ASA-PS score. Moderate inter-rater variability exists in the assignment of ASA-PS scores in the pediatric population, and many patients are being underscored. Use of ASA guidance to assist with pediatric ASA-PS scoring improves the reliability of scoring and may improve accurate communication of perioperative risk.
- New
- Research Article
- 10.1212/wnl.0000000000214615
- Mar 24, 2026
- Neurology
- Laura Tufano + 7 more
We previously demonstrated the feasibility of remote assessments in individuals with myotonic dystrophy type 1 (DM1). This study aimed to evaluate test-retest reliability and agreement of remote assessments and the interrater reliability of video-recorded functional assessments in DM1. Participants were remotely recruited from the National Registry and provided with a toolkit containing a tablet equipped with videoconferencing software and devices for strength and functional assessments. Two remote study visits (RSV1, RSV2) were conducted within 3 months. During each visit, participants completed video-supervised assessments: handgrip, pinch grip (PG), 9-hole peg test (9HPT), video hand opening time (vHOT), timed up and go (TUG), 10-meter walk/run test (10MWRT), sitting and supine forced vital capacity (FVC), sniff nasal inspiratory pressure (SNIP), and tongue and buccal strength tests. Timed tests were video-recorded and scored using standardized protocols. Intraclass correlation coefficients (ICCs) were calculated using 2-way mixed-effects model for test-retest reliability and 2-way random-effects model for interrater reliability. Agreement was evaluated using Bland-Altman plots and measurement sensitivity with minimal detectable differences at 95% confidence (MDD95%). Patient-reported outcomes (PROs) assessing dysphagia (Eating Assessment Tool-10 [EAT-10]) and upper and lower extremity function (Upper Extremity Function Index [UEFI], Lower Extremity Functional Scale [LEFS]) were collected at RSV1 and correlated with quantitative assessments (Spearman coefficient, ρ). Forty individuals with DM1 (average age 47, 55% female) completed both RSVs. Test-retest reliability (ICC, 95% CI) was excellent for handgrip (0.99, 0.98-0.99), sitting and supine FVC (0.98, 0.96-0.99), 10MWRT (0.96, 0.91-0.98), TUG (0.94, 0.89-0.97), and 9HPT (0.93, 0.86-0.97) with adequate measurement sensitivity (MDD95% <30% for all). ICCs were acceptable for PG (0.96, 0.92-0.98), tongue strength (0.93, 0.86-0.96), right (0.93, 0.85-0.97) and left buccal strength (0.88, 0.75-0.94), and SNIP (0.88, 0.77-0.94) and moderate for vHOT (thumb 0.67, 0.42-0.83; middle finger 0.66, 0.40-0.83), but with inadequate measurement sensitivity (MDD95% >30% for all). Interrater reliability (ICC, 95% CI) was excellent for 9HPT (1), vHOT (thumb 0.99, 0.98-0.99; middle finger 0.98, 0.97-0.99), 10MWRT, and TUG (both 0.99, 0.98-0.99). Bland-Altman plots showed no systematic bias. Correlation (ρ, |95% CI|) was strong between handgrip and UEFI (+0.70, 0.49-0.84), 10MWRT and LEFS (-0.72, 0.86-0.49), and moderate between PG and UEFI (+0.60, 0.30-0.82), TUG and LEFS (-0.53, 0.73-0.23), and tongue strength and EAT-10 (-0.49, 0.71-0.20). Remote assessments are feasible and safe. Many measurements demonstrate high reliability, show adequate measurement sensitivity, and correlate with PROs. These findings support remote research, facilitating broad participation, and reducing participant burden.
- Research Article
- 10.1111/joor.70177
- Mar 13, 2026
- Journal of oral rehabilitation
- Pierre Cnockaert + 3 more
Tongue function is critical for essential activities such as feeding and sleep. In particular, tongue protrusion plays a key role in maintaining airway patency during sleep. However, standardised protocols for measuring tongue protrusion are lacking, and the reliability of tongue function assessments in children remains underexplored. To assess the intra- and inter-rater reliability of a novel tongue protrusion measurement method using a 3D-printed stand, alongside established tongue measurements. Fifty-six children aged 4 to 17 years underwent two successive visits with the same rater and a third visit with a different rater, each spaced 1-4 weeks apart. Tongue protrusion strength (pProt) and endurance (eProt), elevation strength (pElev) and endurance (eElev), and tongue pressure during swallowing (pSwal) were assessed using the Iowa Oral Performance Instrument (IOPI). An adjustable3D-printed stand was developed and used to standardise the protrusion task. Tongue mobility was assessed using the Motricité Bucco-Linguo-Faciale tongue subscore (MBLF-t), and mobility restriction due to the frenulum was measured with the Tongue Range of Motion Ratio (TRMR). Reliability was analysed using intraclass correlation coefficients (ICCs). pProt showed good intra-rater reliability and moderate inter-rater reliability, while eProt demonstrated excellent intra-rater and good inter-rater reliability. All other outcomes exhibited at least good intra- and inter-rater reliability, except for pElev, which showed slightly lower inter-rater reliability (ICC = 0.73). This study highlights the reliability of tongue function assessments, including a novel IOPI-based protrusion measurement method, supporting their use in future research and clinical practice. ClinicalTrials.gov ID: NCT06166680.
- Research Article
- 10.1177/10556656261422855
- Mar 13, 2026
- The Cleft palate-craniofacial journal : official publication of the American Cleft Palate-Craniofacial Association
- Angelica Pistoia + 7 more
ObjectiveTo compare long-term esthetic and morphological outcomes of unilateral cleft lip (UCL) repair using the Millard rotation-advancement versus the Tennison-Randall triangular flap technique, testing whether Millard provides superior lip symmetry.DesignRetrospective cohort study.SettingSingle tertiary cleft center.Patients, ParticipantsForty adults aged 18-25 were selected from 168 patients treated between 1988 and 1995. Inclusion criteria: complete UCL, primary cheiloplasty, absence of secondary lip revisions, complete frontal photographic documentation, and no syndromic diagnoses. Twenty underwent Millard repair and twenty Tennison-Randall repair.InterventionsPrimary UCL reconstruction performed using either Millard rotation-advancement or Tennison-Randall triangular flap. All operations were carried out by the same senior cleft surgeon under comparable operative conditions.Main Outcome Measure(s)Long-term lip symmetry quantified through a Symmetry Index derived from predefined anthropometric landmarks on standardized images. Subjective esthetic satisfaction assessed using the esthetic Units Satisfaction Questionnaire and the Cleft esthetic Rating Scale, completed by patients, the operating surgeon, and a blinded observer.ResultsMillard repair showed significantly greater medial lip width symmetry (p = .014). No significant differences were found for vermilion height, prolabial height, lateral lip width, or lip area. Subjective assessments consistently favored Millard, showing higher satisfaction and fewer negative ratings. Inter-rater reliability across evaluators was high (ICC = 0.82).ConclusionsBoth techniques produced stable long-term outcomes, but Millard yielded superior medial lip symmetry and higher esthetic satisfaction. These findings support its continued clinical preference and highlight the importance of long-term evaluations. Larger prospective studies are needed to confirm these results.
- Research Article
- 10.1177/14034948261423410
- Mar 13, 2026
- Scandinavian journal of public health
- Rachel C Davis + 5 more
Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening are limited. The objective of this study was to evaluate the performance of a custom generative pretrained transformer (cGPT) for full text screening. A cGPT powered by OpenAI's ChatGPT4o was tested with subsets of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and an assistant reviewer. Cohen's kappa was used to assess interrater agreement. For the inclusion/exclusion decision, the human-human kappa scores ranged from 0.87 to 0.96, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.59 to 0.67) and assistant cGPT-human pairings (0.62 to 0.72). For exclusion reason classification, the human-human kappa scores ranged from 0.71 to 0.78, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.47 to 0.53) and assistant cGPT-human pairings (0.52 to 0.63). The assistant cGPT outperformed the autonomous cGPT. An assistant cGPT could speed up systematic reviewing in a sufficiently reliable manner, however, further research is needed to establish standardized thresholds for practical use. Improved speed of systematic reviewing has implications for directing timely public health policy decisions.
- Research Article
- 10.3174/ajnr.a9021
- Mar 12, 2026
- AJNR. American journal of neuroradiology
- Carmen R Cerron-Vela + 5 more
The sphenoid bone forms from multiple ossification centers. Its body develops through the fusion of presphenoid and postsphenoid cartilages separated by the intersphenoid synchondrosis. Variations in ossification can lead to persistent craniopharyngeal duct remnants, potentially associated with pituitary dysfunction or tumors. We aimed to determine the timeline of closure of these synchondroses and associated foramina in children without skull base abnormalities on CT scans. This retrospective study analyzed CT scans of children aged 0-6 years from a tertiary pediatric hospital (2018-2022). Scans with abnormalities or skull anomalies were excluded. Two pediatric radiologists assessed synchondroses and foramina, classifying them as patent or fused. Sample size was determined using area under the curve (AUC) analysis. Statistical methods included descriptive analysis, interrater reliability (Cohen κ, intraclass correlation coefficient), Mann-Whitney U test, and cut-point analysis with bootstrapping to determine closure times. We analyzed 160 scans (94 boys, 58.8%; 66 girls, 41.2%) with a median age of 1.4 years (interquartile range: 0.3-3.7). Interrater reliability was strong (κ > 0.80) for most structures, moderate for detecting intrapresphenoid synchondrosis and pneumatization, and weak for intrapostsphenoid synchondrosis. Cut-point analysis demonstrated that the intersphenoid synchondrosis fused first at 4 months, followed by the intrapresphenoid synchondrosis, the anterior and posterior foramen, with pneumatization occurring last at 24.8 months; all with an AUC >80%. Pair-wise threshold differentiation showed pneumatization followed the closure of intersphenoid synchondrosis, intrapresphenoid synchondrosis, and anterior foramen by 22.8, 22.7, and 17.4 weeks, respectively. The sphenoid body synchondroses and foramina show a predictable closure timeline within the first year of life, while pneumatization commences after the second year. Understanding this timeline provides radiologists with a reference standard for interpreting CT examinations that include the skull base (eg, head, maxillofacial, temporal bone CTs) in children younger than 2 years of age, supporting more confident interpretation and potentially reducing overcalling and related follow-up imaging.
- Research Article
- 10.1044/2025_jslhr-24-00713
- Mar 12, 2026
- Journal of speech, language, and hearing research : JSLHR
- Amélie Brisebois + 3 more
Lexical performance in discourse is of considerable interest in acquired communication disorders. The transcription-free core lexicon measure evaluates the most typical words a person uses during communication. This study aimed (a) to develop core lexicon lists in Laurentian French speakers without brain injury and (b) to assess their psychometric properties. Spoken discourse was elicited using the picture description task from the Western Aphasia Battery-Revised (WAB-R; Kertesz, 2006) and the Cinderella Story Telling (CST) task. Participants were Laurentian French speakers from Quebec, aged 50-79 years, without brain injury. Sixty-six completed the WAB-R task, and 48 completed the CST task. Core noun and verb lists were created using the CLAN program, including words produced by at least 50% of the sample. Two raters scored all audio samples. Intra- and interrater reliability and long-term test-retest reliability were calculated. Construct validity was examined through correlations with micro- and macrostructural discourse measures. Four core lexicon lists were generated. For the WAB-R, 19 nouns and five verbs were identified; for the CST, 19 nouns and 16 verbs were identified. Intrarater reliability was excellent across variables, and interrater reliability was excellent for all core noun lists and CST core verbs and good for WAB-R core verbs. Long-term test-retest reliability ranged from poor to moderate across measures. Core lexicon scores were significantly and positively correlated with 12 macrostructural and nine microstructural variables. This study supports the rater reliability and construct validity of core lexicon measures in Laurentian French speakers across two discourse tasks. It also provides the first long-term test-retest reliability data for core lexicon scoring, offering insights that guide its clinical and research applications. https://doi.org/10.23641/asha.31236010.
- Research Article
- 10.3174/ajnr.a9294
- Mar 12, 2026
- AJNR. American journal of neuroradiology
- Anass Benomar + 14 more
Contrast-enhanced 3D T1-weighted MRI is the imaging reference for detection and follow-up of brain metastases. Volumetric GRE-based sequences, such as MPRAGE, are widely used but remain prone to susceptibility and lower lesion conspicuity. 3D black-blood TSE-based sequences, such as Sampling Perfection with Application-Optimized Contrasts by using different flip angle Evolutions (SPACE), have been increasingly embedded into routine workflow and are thought to improve lesion detection in part through vessel signal suppression. We aimed to investigate the comparative diagnostic performance of 3D T1 TSE versus GRE sequences for the detection of brain metastases. Studies comparing the diagnostic performance of postcontrast 3D T1 SE and GRE sequences in adults with brain metastases were searched on MEDLINE, EMBASE, Cochrane Central, Google Scholar, and PROSPERO, from inception through April 2025. Fifteen studies encompassing 544 patients with 4338 metastases were included. Data on diagnostic accuracy parameters, image quality, and inter-rater agreement were extracted. Random-effects models were applied to compute pooled sensitivity and comparative OR for lesion detection. Risk of bias was assessed using QUADAS-2 and QUADAS-C tools. Pooled sensitivities for detection of brain metastases were 97.4% (95%CI, 93.2%-99.0%) for TSE and 76.1% (95%CI, 69.3-81.9) for GRE-based sequences, with a comparative OR of 12.0 (95%CI, 5.45-26.6, P <.0001). Detectability of small lesions (<5 mm) was significantly better on TSE (96.1%; 95%CI, 87.7-98.8) than GRE (58.4%; 95%CI, 47.9-68.2), while both techniques performed comparably for larger (≥5 mm) lesions (98.2% for TSE, 94.4% for GRE). OR estimates were 17.2 (95%CI, 4.50-66.1) for small and 2.81 (95%CI, 0.92-8.56) for large lesions. Contrast-to-noise-ratio and inter-rater agreements were slightly higher on TSE than GRE. False positives were more common with TSE, mostly related to incomplete vessel suppression (49 FP counts in TSE, 35 in GRE). Our meta-analysis is limited by high heterogeneity, case-only studies, possible small-study effects, and high risk of bias for the reference standard domain. Postcontrast 3D T1 TSE sequences provide higher sensitivity and improved lesion conspicuity compared with GRE sequence, particularly for small metastases, though at the cost of slightly higher false positives.