Articles published on Inter-rater Reliability
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
40895 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.jbmt.2025.10.054
- Jun 1, 2026
- Journal of bodywork and movement therapies
- Bergdal L + 3 more
The tibialis posterior muscle has an important role both in stabilizing the foot and in inversion, plantar flexion, and adduction of the foot. Impaired function can lead to tibialis posterior dysfunction. A clinical test that can objectively measure tibialis posterior strength is warranted. The aim of this study was to investigate the interrater, test-retest, and intersession reliability of a test designed to measure tibialis posterior strength with a hand-held dynamometer. Interrater, between-day test-retest and intersession reliability. University laboratory. The participants comprised 20 healthy individuals (mean age 28.8 years, n=10 women) without foot problems. A test was designed to test tibialis posterior strength with a hand-held dynamometer (HHD). The test was performed on two occasions 5-15 days apart and was carried out by two raters. The intraclass correlation coefficient (ICC), 95% confidence interval, standard error of measurement (SEM), and minimal detectable change were calculated. Interrater reliability was good on both occasions (ICC: 0.769, 0.794), test-retest reliability was moderate for both raters (ICC: 0.671, 0.672), and intersession reliability was excellent (ICC: 0.934-0.967). However, the confidence interval had a large variation (-0.027-0.986) and the SEM was relatively high (2.356-3.863N). This test seems to be reliable, but has some limitations. The results suggest that the current version of the test could be used to compare strength between feet, but that further development of the test is needed to achieve increased interrater and test-retest reliability.
- New
- Research Article
1
- 10.1007/s00261-025-05256-5
- Jun 1, 2026
- Abdominal radiology (New York)
- Michael Phillipi + 14 more
Approximately 20-50% of patients develop biochemical recurrence (BCR) of prostate cancer within 10 years following radical prostatectomy (RP). The accurate identification of recurrent disease is crucial for guiding salvage treatment decisions. While multiparametric MRI (mpMRI) and prostate-specific membrane antigen positron emission tomography/computed tomography (PSMA PET/CT) are both utilized for detecting local recurrence, their combined diagnostic benefits remain unclear. This study seeks to evaluate the diagnostic performance of both modalities alone and in conjunction for detecting local recurrence following RP in patients with BCR. A retrospective single-institution analysis included 37 post-RP patients with BCR who received mpMRI and PSMA PET/CT. Five board-certified radiologists reviewed images in three phases: mpMRI only, PSMA PET/CT only, and both modalities combined. Multidisciplinary tumor board consensus served as the reference standard. Diagnostic performance, inter-reader agreement, and radiologist confidence with each modality was examined. MpMRI outperformed PSMA PET/CT, yielding a higher sensitivity (73.0% vs. 65.2%) and specificity (77.1% vs. 75.7%). Interpretation of mpMRI and PSMA PET/CT together achieved the highest diagnostic accuracy (77.8%), representing a statistically-significant increase over PSMA PET/CT (p = 0.026) but a non-statistically-significant increase over mpMRI (p = 0.441). Combined imaging also resulted in greater specificity (90.0%) and inter-rater reliability (κ = 0.622). However, in some cases performance decreased with both modalities due to interpretive pitfalls. While mpMRI remains the preferred imaging modality for post-RP local recurrence surveillance, the integration of PSMA PET/CT may lead to improved specificity and inter-rater reliability. However, radiologists must understand each modality's limitations to avoid interpretive pitfalls.
- New
- Research Article
- 10.1016/j.jbmt.2025.10.038
- Jun 1, 2026
- Journal of bodywork and movement therapies
- Kira Eimiller + 4 more
Reliability of the modified Thomas test in those with low back pain.
- New
- Research Article
1
- 10.1016/j.ddj.2025.100043
- Jun 1, 2026
- Digital Dentistry Journal
- Maite Aretxabaleta + 9 more
Retrospective validation of a novel digital methodology for evaluating intraoral maxillary scans in newborns with craniofacial disorders
- New
- Research Article
- 10.1016/j.artd.2026.101992
- Jun 1, 2026
- Arthroplasty today
- Rapeepat Narkbunnam + 5 more
Comparison of the Reliability and Gap Differences Between Using the Varus-Valgus Stress Technique and Paddles Technique in Pre-Resection Balancing Phase of Robotic-Assisted Total Knee Arthroplasty.
- New
- Research Article
- 10.1016/j.jemermed.2026.02.042
- Jun 1, 2026
- The Journal of emergency medicine
- Patricia Hernández + 7 more
Evaluating the Performance of Large Language Models for Generating Emergency Department Discharge Instructions.
- New
- Research Article
- 10.1016/j.chiabu.2026.108025
- Jun 1, 2026
- Child abuse & neglect
- Paulo Correia Silva + 2 more
Risk and safety of children and youth in child protection systems: A systematic review of risk assessment instruments.
- New
- Research Article
- 10.1016/j.jclinepi.2026.112226
- Jun 1, 2026
- Journal of clinical epidemiology
- Pier Carlo Battain + 3 more
An exploratory descriptive survey on the use of GRADE and CINeMA: time-consuming, process transparency and subjectivity versus high-speed, practical challenges and poor understanding.
- New
- Research Article
- 10.1002/puh2.70222
- Jun 1, 2026
- Public health challenges
- Benjamin Abaidoo + 2 more
Despite the availability of various methods for assessing medication adherence, limited guidance exists regarding the most appropriate tool, particularly for measuring glaucoma medication adherence. To achieve expert consensus on the appropriate tool for measuring glaucoma medication adherence using the Delphi technique. A two-round Delphi study was conducted with a panel of experts from diverse fields, assessing three validated adherence measurement tools. Consensus was determined using Kendall's Coefficient of Concordance. The extent of agreement and inter-rater reliability were evaluated using the scale-level content validity index (SCVI) and intraclass correlation coefficients (ICC), analysed in SPSS version 25. Sixteen experts (mean age 53.8±7.1 years; mean professional experience: 21.9±6.8 years) participated. Consensus levels of 81.0% and 89.0% were achieved in the first and second rounds, respectively. Agreement on non-adherence characteristics was high (SCVI and ICC values>0.75). The most appropriate tool for measuring non-adherence to glaucoma medication was the Glaucoma Treatment Compliance Assessment Tool-Short form (GTCAT-S) with an SCVI of 0.91 and ICC of 0.94 (95% CI: 0.78-0.99; p=0.001). The GTCAT-S was identified as the most suitable tool for measuring non-adherence to glaucoma medication. It demonstrated a high SCVI and excellent inter-rater reliability, indicating strong consensus among experts and robust measurement consistency.
- New
- Research Article
- 10.1016/j.jad.2026.121341
- Jun 1, 2026
- Journal of affective disorders
- Sarah Bloch-Elkouby + 9 more
The clinician rated suicide crisis syndrome checklist (SCS-C): Structure, reliability, and concurrent validity among adult psychiatric inpatients.
- New
- Research Article
- 10.1016/j.acepjo.2026.100407
- Jun 1, 2026
- Journal of the American College of Emergency Physicians open
- Micah Wolfsohn + 9 more
As there is no published review of video review use in cardiac arrest (CA) research, we set out to perform a scoping review to describe the demographics, settings, interventions, and outcomes in the literature. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework for scoping reviews, we queried PubMed, Scopus, EMBASE, and Cochrane Library from inception through April 22, 2024, and then updated the query to cover publications through March 7, 2025, including adult CA studies using video-derived data from prehospital, emergency department, or intensive care settings, excluding pediatric and simulation studies. Independent screening and data extraction were both performed by 2 of the reviewers, with a third reviewer and principal investigator resolving discrepancies, respectively. Extracted data encompassed study aims, setting, patient demographics, video review reliability, and detailed information on outcomes, metrics (eg, chest compression fraction), and interventions (eg, intubation). From 3081 identified publications, 76 were included, with 64.5% being manuscripts. They originated from the USA (48.7%), Asia (31.6%), and Europe (19.7%). Studies were predominantly single center (98.7%), from urban settings (82.9%), with retrospective (47.4%) or prospective (31.6%) observational designs. There was marked heterogeneity in reporting methodologies. The median number of patients enrolled was 71. Interrater reliability was reported in only 12 studies. Common reported patient outcomes included return of spontaneous circulation (39.5%), key metrics such as duration of interruptions (52.6%), and time-to-events (51.3%). Frequently reported interventions included mechanical compression device use (36.8%), defibrillation (34.2%), and intubation (28.9%). Publication volume significantly increased over the last 2 decades. Video review enables a precise, multidomain assessment of resuscitation performance of CA that conventional data sources cannot provide. Future work should prioritize consensus definitions and establishing minimum reporting standards.
- New
- Research Article
- 10.1111/ajag.70150
- Jun 1, 2026
- Australasian journal on ageing
- Merve Yilmaz Kars + 7 more
This study aimed to assess the validity and reliability of the Turkish version of the Rapid Sarcopenia Screening (RSS) tool, designed to provide a quick and practical method for identifying sarcopenia in older adults. A cross-sectional observational study was conducted among 150 individuals aged 60 years and older attending a geriatric outpatient clinic in Türkiye. The RSS underwent forward-backward translation and linguistic validation. Construct validity was examined by correlation with the SARC-F. Internal consistency was assessed using Cronbach's alpha, whereas intra-rater and inter-rater reliability were evaluated with intraclass correlation coefficients (ICCs). Discriminant validity was analysed using receiver operating characteristic (ROC) curves. The median age of participants was 75 years (range: 62-103), and 59% were female. According to EWGSOP2 criteria, 27% were diagnosed with sarcopenia. The RSS showed good internal consistency (Cronbach's alpha = 0.768). A strong inverse correlation with SARC-F scores (rho = -0.688, p < 0.001) supported construct validity. ROC analysis demonstrated good discriminatory power (AUC = 0.817). Reliability was also excellent, with intra-rater and inter-rater ICCs of 0.980 and 0.963, respectively. The Turkish RSS is a valid, reliable and practical screening tool for detecting sarcopenia in older adults. Its brevity and reliance on self-reported items support its feasibility for routine clinical practice and suggest its potential as an alternative screening method in Türkiye.
- New
- Research Article
- 10.1097/nna.0000000000001740
- Jun 1, 2026
- The Journal of nursing administration
- Heather Watson + 5 more
The Constant Observation Resource Assessment (CORA) was developed to guide prioritization of resources. Constant observers, also called patient safety attendants, support safe, less restrictive care for high-risk patients. Staffing shortages highlight the need for structured, equitable constant observation (CO) allocation. In this prospective pilot on 15 adult and 4 pediatric units, researchers evaluated CORA's content validity and inter-rater reliability. Over 6 months in 2021, nurses completed pre-/postimplementation surveys on transparency, fairness, accuracy, and confidence in allocation. A subset of assessments was dual-rated within the same shift to calculate the interclass correlational coefficient (ICC) (2,1). Staff ratings improved across all domains (P≤0.002; Cohen's d ≈ 0.74-1.31). Among 600 assessments, the total score ICC(2,1) was 0.717 [95% CI (0.647, 0.775)]; subscales ranged from 0.607 to 0.720. Cronbach's α was 0.888 (raw). CORA showed acceptable inter-rater reliability and initial validity, supporting its potential to standardize CO allocation. The CORA tool offers a standardized, evidence-based approach to allocating CO resources, addressing long-standing concerns about fairness and transparency in sitter assignments. Its implementation may improve patient safety, support clinical decision-making, and reduce burden on nursing staff in high-acuity inpatient settings. Future studies will assess predictive performance and refine decision thresholds.
- New
- Research Article
- 10.1016/j.jsea.2026.100001
- Jun 1, 2026
- Journal of shoulder and elbow arthroplasty
- Patrick Sun + 5 more
Computed tomography Hounsfield units (HUs) estimate bone mineral density, but their predictive value for total shoulder arthroplasty (TSA) mechanical complications is unknown. This study assessed whether glenoid HU (gHU) can be used to predict mechanical complication risk. In this retrospective cohort study at a single tertiary academic center (January 2011-April 2024), pre-operative computed tomography scans from 250 TSAs in 233 patients were independently reviewed by 3 interdisciplinary reviewers to measure gHU. Mechanical complications, including periprosthetic or intraoperative fractures and aseptic implant loosening, were identified and stratified by gHU. Inter-rater reliability was assessed with the intraclass correlation coefficient. Receiver operating characteristic analysis identified optimal gHU thresholds. Risk-group comparisons used chi-square tests and multivariable logistic regression. Among 250 TSAs in 233 patients, intraclass correlation coefficient for gHU measurement was 0.90 (95% confidence interval 0.87-0.92). Receiver operating characteristic analysis yielded a cutoff of 177 HU (area under the curve 0.59; sensitivity 39%; specificity 81%). Mechanical complications occurred in 29% of cases with gHU <177 vs. 13% with gHU ≥177 (P = .008; adjusted P = .02). Patients with gHU <110 had the highest risk (57% vs. 14%; P < .001; adjusted P = .002). Pre-operative gHU measurements can help to predict increased risk of mechanical complications after TSA. A gHU threshold of 177 HU identifies patients at elevated risk, while gHU <110 marks the highest-risk group.
- New
- Research Article
- 10.1016/j.jcms.2026.104547
- Jun 1, 2026
- Journal of cranio-maxillo-facial surgery : official publication of the European Association for Cranio-Maxillo-Facial Surgery
- Xincen Hou + 5 more
This study aimed to evaluate the intra- and inter-rater reproducibility of periocular tumor measurements using three-dimensional (3D) stereophotogrammetry and to explore its potential clinical feasibility in periocular tumor assessment. Standardized 3D facial images were obtained from 150 patients with a total of 175 periocular tumors using the Vectra M3 imaging system. Surface areas were independently measured by two raters. Intra- and inter-rater reliability were assessed using intraclass correlation coefficients (ICC), mean absolute deviation (MAD), technical error of measurement (TEM), relative error of measurement (REM), and relative TEM (rTEM). Key factors influencing measurement reliability were identified through LASSO regression and random forest analysis. Overall measurement reliability was excellent, with intra-rater and inter-rater ICCs of 0.998 and 0.974, respectively. The mean intra-rater and inter-rater MAD were 0.63mm2 and 0.40mm2, while TEMs were 2.29mm2 and 7.81mm2. Intra-rater and inter-rater REM values were 1.94% and 1.22%, respectively. Tumors >5mm showed significantly higher reliability than tumors ≤5mm (p=0.010). Tumors with well-defined boundaries had superior reproducibility compared to those with unclear margins (p=0.026). Localization influenced reliability, with lateral canthus tumors showing the highest consistency (p=0.008). Tumor color also affected reliability, with brownish-black tumors exhibiting the greatest reproducibility (p=0.021). Three-dimensional stereophotogrammetry demonstrated high reproducibility for periocular tumor assessment. These findings suggest its potential clinical applicability as an adjunctive tool to support preoperative planning and longitudinal monitoring.
- New
- Research Article
- 10.1016/j.clnesp.2026.103111
- Jun 1, 2026
- Clinical nutrition ESPEN
- Erin Russell + 8 more
Reliability and accuracy of rectus femoris muscle measurements by dietitians using ultrasound, compared to sonographers.
- New
- Research Article
- 10.1016/j.ymgme.2026.109876
- Jun 1, 2026
- Molecular genetics and metabolism
- Randa Sultan + 9 more
Development and validation of a clinical severity score for long-chain fatty acid oxidation disorders using Real-World-Evidence from Canada.
- New
- Research Article
- 10.1016/j.drugalcdep.2026.113125
- Jun 1, 2026
- Drug and alcohol dependence
- Isabella G Bourtin + 6 more
The wake of addiction: Pharmacological strategies for sleep disturbances in stimulant use disorders, a systematic review.
- New
- Research Article
- 10.1002/lio2.70433
- Jun 1, 2026
- Laryngoscope investigative otolaryngology
- Mohammed Sulaiman Alsayyari + 4 more
Large language models (LLMs), such as ChatGPT, are increasingly utilized by physicians for clinical decision support due to their ease of use and versatility. However, their performance in diagnostic imaging remains largely untested. This study prospectively evaluates ChatGPT's ability to interpret sinus computed tomography (CT) scans for chronic rhinosinusitis (CRS), using radiologist assessment as the reference standard. In this prospective cohort study, 102 coronal sinus CT scans were evaluated by both a board-certified radiologist and ChatGPT-4o. Each scan was screen recorded and uploaded twice to ChatGPT to assess repeatability, resulting in 306 total interpretations. The radiologist reviewed the same screen recordings provided to ChatGPT. Both raters assessed 11 predefined binary anatomical features and generated Lund-Mackay scores. Diagnostic performance was assessed using standard accuracy metrics, and inter-rater agreement was evaluated using established reliability coefficients. ChatGPT demonstrated variable performance across anatomical features. Sensitivity ranged from 0.00 to 0.89, and specificity from 0.26 to 0.95. The model demonstrated relatively high sensitivity for mucosal thickening (0.84) and sinus expansion (0.73), as well as strong agreement with the radiologist for the lamina papyracea (AC1 = 0.92) and anterior ethmoid artery (AC1 = 0.77). However, performance was poor for air-fluid levels and bone thinning. Agreement with the radiologist was low across most features (AC1 < 0.4 in 82% of variables), and repeatability between ChatGPT versions was limited (mean AC1 = 0.29). Correlation between runs for Lund-Mackay scores was weak (r = 0.11), and agreement with the radiologist was poor (ICC < 0.07). ChatGPT demonstrates partial capability in identifying specific sinus CT findings; however, it lacks overall diagnostic consistency. Human radiologists remain essential, and the clinical use of LLMs in imaging should be approached with caution.
- New
- Research Article
- 10.1016/j.ijmedinf.2026.106363
- Jun 1, 2026
- International journal of medical informatics
- Acieh Es'Haghi + 3 more
Accuracy and completeness of large language models in Epidemic keratoconjunctivitis Queries: A Comparative study.