Reliable Assessment of Healthcare Procedures: A Comparison of On-Site and Video-Based Methods.
This study compares on-site, independent video, and collaborative video assessments of venipuncture performance, finding significant score differences and higher reliability with collaborative review, though increased workload may limit practical implementation; video-based methods enhance scoring consistency.
Assessing performance is critical in simulation training programs. Traditionally, predefined evaluation forms completed by experts are used, but this approach may introduce bias. Common strategies to mitigate bias include averaging scores from multiple evaluators or using video recordings to minimize disagreements among assessors. This study compares performance scores obtained through real-time on-site observation, independent video assessment, and collaborative video assessment of venipuncture performance. Additionally, we evaluate whether combining these methods enhances scoring consistency. Eighteen medical students were invited to perform venipuncture trials, which were evaluated in three stages. Two evaluators observed and scored each trial on-site using a predefined evaluation form. The trials were video-recorded. At 12weeks post-training, the evaluators independently reviewed the videos and assigned performance scores. At 14weeks, they collaboratively reviewed the videos and provided a joint performance score. The Intraclass Correlation Coefficient (ICC) was used to assess the consistency between evaluators. Mean scores differed significantly among the three assessment methods (P = 0.037), with independent video assessment yielding higher scores (75.3 ± 5.2) than on-site assessment (66.7 ± 6.1). Inter-rater reliability (ICC) ranged from 0.706 to 0.883, but was not statistically compared across methods.While collaborative assessment showed the highest consistency (0.889), implementing a multimodal approach would substantially increase faculty workload, as it requires scoring each student multiple times. Video-based assessments, particularly collaborative review, enable detailed and repeated analysis of procedural skills, improving scoring consistency. However, the feasibility and workload implications of combining multiple methods must be considered before implementation in training programs.
- Research Article
- 10.3390/jcm15093200
- Apr 22, 2026
- Journal of Clinical Medicine
Background: Digital health has accelerated telehealth uptake, yet evidence comparing video-based musculoskeletal assessment with traditional in-person examination is limited. This study evaluated the concurrent validity and interrater reliability of video-based physiotherapy assessment versus face-to-face assessment in patients with knee pain. Methods: Patients with knee pain underwent randomized consecutive in-person and video-based assessments by experienced musculoskeletal physiotherapists. Clinical diagnoses were categorized into seven groups (red flag, yellow flag, arthrogenic, tendinopathy, patellofemoral pain, muscle sprain, neurogenic). Primary outcomes were intermethod agreement and Cohen’s kappa; sensitivity, specificity, PPV, NPV, and interrater reliability for video assessments were also reported. Results: Forty-five participants (mean age 38 ± 6.5 years; 55.6% female) completed the study. In-person and video-based assessments produced identical diagnoses in 43/45 cases (Cohen’s κ = 0.92, p < 0.001). Telehealth accuracy was high across all diagnostic categories (90–100%). Interrater agreement between video-based assessors was 93.3% (κ = 0.89, p < 0.001). Agreement between assessments was moderately associated with KOOS (r = 0.312, p = 0.037). Conclusions: In this selected pragmatic sample, video-based physiotherapy assessment demonstrated high concurrent agreement and excellent interrater reliability with face-to-face assessment. Given the study’s sample size, repeated-measures design, and lack of an independent reference standard, these results indicate feasibility and intermethod agreement rather than diagnostic equivalence. Video assessment may be a feasible option for triage and management in selected settings, but further research in larger, more diverse populations and evaluation against independent reference standards is required.
- Research Article
- 10.1016/j.jbiomech.2026.113272
- May 1, 2026
- Journal of biomechanics
Reliability of medial versus lateral video analysis for footstrike assessment in treadmill and overground running.
- Research Article
3
- 10.1002/mdc3.14222
- Oct 8, 2024
- Movement disorders clinical practice
Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS) is widely employed for cervical dystonia (CD) evaluation. To assess the inter-rater reliability of the severity subscale of the original and revised TWSTRS using video recordings. Three raters, a PhD student with a nursing degree, a physiotherapist specialized in CD, and a neurologist-in-training independently rated all videos. The inter-rater reliability was assessed with the intra-class correlation coefficient (ICC). The total severity score of both tools demonstrated a good inter-rater reliability (ICC = 0.87 to 0.88). The inter-rater reliability of individual sub-items varied from poor (ICC = 0.29) to excellent (ICC = 0.9). The total severity score of both TWSTRS showed good inter-rater reliability in a multidisciplinary team, indicating their applicability for online patients' assessment. We recommend using the total subscale for outcome comparison. Furthermore, there is a need for more accurate definitions of duration factor and shoulder elevation.
- Research Article
2
- 10.1097/sih.0000000000000672
- Oct 18, 2022
- Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare
Reliability is pivotal in surgical skills assessment. Video-based assessment can be used for objective assessment without physical presence of assessors. However, its reliability for surgical assessments remains largely unexplored. In this study, we evaluated the reliability of video-based versus physical assessments of novices' surgical performances on human cadavers and 3D-printed models-an emerging simulation modality. Eighteen otorhinolaryngology residents performed 2 to 3 mastoidectomies on a 3D-printed model and 1 procedure on a human cadaver. Performances were rated by 3 experts evaluating the final surgical result using a well-known assessment tool. Performances were rated both hands-on/physically and by video recordings. Interrater reliability and intrarater reliability were explored using κ statistics and the optimal number of raters and performances required in either assessment modality was determined using generalizability theory. Interrater reliability was moderate with a mean κ score of 0.58 (range 0.53-0.62) for video-based assessment and 0.60 (range, 0.55-0.69) for physical assessment. Video-based and physical assessments were equally reliable (G coefficient 0.85 vs. 0.80 for 3D-printed models and 0.86 vs 0.87 for cadaver dissections). The interaction between rater and assessment modality contributed to 8.1% to 9.1% of the estimated variance. For the 3D-printed models, 2 raters evaluating 2 video-recorded performances or 3 raters physically assessing 2 performances yielded sufficient reliability for high-stakes assessment (G coefficient >0.8). Video-based and physical assessments were equally reliable. Some raters were affected by changing from physical to video-based assessment; consequently, assessment should be either physical or video based, not a combination.
- Research Article
16
- 10.3390/ijerph18041652
- Feb 1, 2021
- International Journal of Environmental Research and Public Health
The Test of Gross Motor Development (TGMD) is one of the most common tools for assessing the fundamental movement skills (FMS) in children between 3 and 10 years. This study aimed to examine the intra-rater and inter-rater reliability of the TGMD—3rd Edition (TGMD-3) between expert and novice raters using live and video assessment. Five raters [2 experts and 3 novices (one of them BSc in Physical Education and Sport Science)] assessed and scored the performance of the TGMD-3 of 25 healthy children [Female: 60%; mean (standard deviation) age 9.16 (1.31)]. Schoolchildren were attending at one public elementary school during the academic year 2019–2020 from Santiago de Compostela (Spain). Raters scored each children performance through two viewing moods (live and slow-motion). The ICC (Intraclass Correlation Coefficient) was used to determine the agreement between raters. Our results showed moderate-to-excellent intra-rater reliability for overall score and locomotor and ball skills subscales; moderate-to-good inter-rater reliability for overall and ball skills; and poor-to-good for locomotor subscale. Higher intra-rater reliability was achieved by the expert raters and novice rater with physical education background compared to novice raters. However, the inter-rater reliability was more variable in all the raters regardless of their experience or background. No significant differences in reliability were found when comparing live and video assessments. For clinical practice, it would be recommended that raters reach an agreement before the assessment to avoid subjective interpretations that might distort the results.
- Research Article
3
- 10.4055/cios24190
- Jan 1, 2024
- Clinics in orthopedic surgery
Strain elastography (SE) and shear wave elastography (SWE) are emerging techniques for evaluating the elasticity of soft tissue. This study aimed to determine interobserver and intraobserver reliability for elasticity measurements of different tissues and anatomic locations using SE and SWE. Ten healthy adult male individuals with 20 upper extremities participated in this study. The elasticities of the wrist extensor muscle, the common extensor tendon, and supraspinatus tendon were measured. Strain ratio and shear wave velocity were measured twice by 2 different examiners (examiner 1 with over 20 years of experience in musculoskeletal sonography and examiner 2 with 1 year of experience). Interobserver and intraobserver reliability was assessed using the intraclass correlation coefficient (ICC). The 10 individuals' age ranged from 28 to 35 years. In SE, interobserver reliabilities at the 3 anatomic locations (wrist extensor muscle, common extensor tendon, and supraspinatus tendon) showed fair to moderate agreement (ICC = 0.489, p = 0.076; ICC = 0.408, p = 0.131; and ICC = 0.296, p = 0.711, respectively). The intraobserver reliabilities of examiner 1 were moderate to substantial only at the wrist extensor muscle and the common extensor tendon (ICC = 0.563, p = 0.039 and ICC = 0.702, p = 0.006, respectively). In SWE, interobserver reliabilities for the wrist extensor muscle and the supraspinatus tendon were moderate to substantial (ICC = 0.756, p = 0.002 and ICC = 0.565, p = 0.039, respectively). The intraobserver reliabilities of examiner 1 at the 3 anatomic locations were almost perfect (ICC = 0.843, p = 0.001; ICC = 0.800, p = 0.001; and ICC = 0.825, p = 0.001, respectively). The results of examiner 2 showed almost perfect agreement at the wrist extensor muscle (ICC = 0.886, p = 0.001) and moderate to substantial agreement at the tendons of the common extensor and supraspinatus (ICC = 0.592, p = 0.029 and ICC = 0.682, p = 0.008, respectively). SWE is a reliable method for assessing the flexibility of soft tissue, but it is affected by expertise and the specific anatomical site.
- Research Article
13
- 10.2196/38101
- Aug 22, 2022
- JMIR Rehabilitation and Assistive Technologies
BackgroundRehabilitation provided via telehealth offers an alternative to currently limited in-person health care. Effective rehabilitation depends on accurate and relevant assessments that reliably measure changes in function over time. The reliability of a suite of relevant assessments to measure the impact of rehabilitation on physical function is unknown.ObjectiveWe aimed to measure the intrarater reliability of mobility-focused physical outcome measures delivered via Zoom (a commonly used telecommunication platform) and interrater reliability, comparing Zoom with in-person measures.MethodsIn this reliability trial, healthy volunteers were recruited to complete 7 mobility-focused outcome measures in view of a laptop, under instructions from a remotely based researcher who undertook the remote evaluations. An in-person researcher (providing the benchmark scores) concurrently recorded their scores. Interrater and intrarater reliability were assessed for Grip Strength, Functional Reach Test, 5-Time Sit to Stand, 3- and 4-Meter Walks and Timed Up and Go, using intraclass correlation coefficients (ICC) and Bland-Altman plots. These tests were chosen because they cover a wide array of physical mobility, strength, and balance constructs; require little to no assistance from a clinician; can be performed in the limits of a home environment; and are likely to be feasible over a telehealth delivery mode.ResultsA total of 30 participants (mean age 36.2, SD 12.5 years; n=19, 63% male) completed all assessments. Interrater reliability was excellent for Grip Strength (ICC=0.99) and Functional Reach Test (ICC=0.99), good for 5-Time Sit to Stand (ICC=0.842) and 4-Meter Walk (ICC=0.76), moderate for Timed Up and Go (ICC=0.64), and poor for 3-Meter Walk (ICC=–0.46). Intrarater reliability, accessed by the remote researcher, was excellent for Grip Strength (ICC=0.91); good for Timed Up and Go, 3-Meter Walk, 4-Meter Walk, and Functional Reach (ICC=0.84-0.89); and moderate for 5-Time Sit to Stand (ICC=0.67). Although recorded simultaneously, the following time-based assessments were recorded as significantly longer via Zoom: 5-Time Sit to Stand (1.2 seconds), Timed Up and Go (1.0 seconds), and 3-Meter Walk (1.3 seconds).ConclusionsUntimed mobility-focused physical outcome measures have excellent interrater reliability between in-person and telehealth measurements. Timed outcome measures took approximately 1 second longer via Zoom, reducing the reliability of tests with a shorter duration. Small time differences favoring in-person attendance are of a similar magnitude to clinically important differences, indicating assessments undertaken using telecommunications technology (Zoom) cannot be compared directly with face-to-face delivery. This has implications for clinicians using blended (ie, some face-to-face and some via the internet) assessments. High intrarater reliability of mobility-focused physical outcome measures has been demonstrated in this study.
- Research Article
10
- 10.3390/diagnostics13162661
- Aug 12, 2023
- Diagnostics
Neonatal pain assessment (NPA) represents a huge global problem of essential importance, as a timely and accurate assessment of neonatal pain is indispensable for implementing pain management. To investigate the consistency of pain scores derived through video-based NPA (VB-NPA) and on-site NPA (OS-NPA), providing the scientific foundation and feasibility of adopting VB-NPA results in a real-world scenario as the gold standard for neonatal pain in clinical studies and labels for artificial intelligence (AI)-based NPA (AI-NPA) applications. A total of 598 neonates were recruited from a pediatric hospital in China. This observational study recorded 598 neonates who underwent one of 10 painful procedures, including arterial blood sampling, heel blood sampling, fingertip blood sampling, intravenous injection, subcutaneous injection, peripheral intravenous cannulation, nasopharyngeal suctioning, retention enema, adhesive removal, and wound dressing. Two experienced nurses performed OS-NPA and VB-NPA at a 10-day interval through double-blind scoring using the Neonatal Infant Pain Scale to evaluate the pain level of the neonates. Intra-rater and inter-rater reliability were calculated and analyzed, and a paired samples t-test was used to explore the bias and consistency of the assessors' pain scores derived through OS-NPA and VB-NPA. The impact of different label sources was evaluated using three state-of-the-art AI methods trained with labels given by OS-NPA and VB-NPA, respectively. The intra-rater reliability of the same assessor was 0.976-0.983 across different times, as measured by the intraclass correlation coefficient. The inter-rater reliability was 0.983 for single measures and 0.992 for average measures. No significant differences were observed between the OS-NPA scores and the assessment of an independent VB-NPA assessor. The different label sources only caused a limited accuracy loss of 0.022-0.044 for the three AI methods. VB-NPA in a real-world scenario is an effective way to assess neonatal pain due to its high intra-rater and inter-rater reliability compared to OS-NPA and could be used for the labeling of large-scale NPA video databases for clinical studies and AI training.
- Supplementary Content
- 10.25419/rcsi.10818062.v1
- Nov 23, 2019
- Figshare
Introduction: Torticollis is a clinical sign of asymmetric neck posture. In infancy, the most common causes are muscular in nature and can be classified as Congenital Muscular Torticollis or Postural Torticollis. Assessment of neck function is essential for diagnosis and management of torticollis. A systematic review demonstrated a paucity of reliable and valid measurement tools, in particular for the assessment of postural side-flexion (head tilt) and active neck rotation, in the upright position. Furthermore, most physiotherapists commonly use visual estimation in clinical practice, which has not been adequately tested for reliability in this population. Aims and objectives: This study aimed to examine the reliability of visual estimation for the assessment of head tilt and active neck rotation in the upright position, on infants with torticollis by physiotherapists. A further aim was to examine the impact of the physiotherapists’ clinical experience on their reliability. Methods: This was an observational (reliability) study, which involved the recruitment of 31 infants and 26 physiotherapists. Videos were taken of the infants’ head position in the frontal plane (anterior view) and active neck rotation (lateral view). Using a secure online portal, they were observed and rated by the physiotherapists on two occasions, at least one week apart. Inter-rater and intra-rater reliability was calculated using the intra-class correlation coefficient (ICC) and Standard Error of Measurement (SEM). The relationship between physiotherapists’ clinical experience (using three different criteria) and intrarater reliability was analysed using a Pearson product-moment correlation coefficient. Results: Overall, inter-rater reliability was good (mean ICC: 0.68 ± 0.20, 0.13 - 0.98; mean SEM: 5.1° ± 2.1°, 1-12°). Rotation videos had better reliability (mean ICC: 0.79 ± 0.14), in comparison to head tilt videos (mean ICC: 0.58 ± 0.20). Intra-rater reliability was excellent (mean ICC: 0.85 ± 0.09, 0.55 to 0.94) for both head tilt (mean ICC: 0.84 ± 0.08) and rotation (mean ICC: 0.85 ± 0.09). There was no correlation between intra-rater reliability and clinical experience. Conclusions and implications: Visual estimation has excellent intra-rater reliability and good inter-rater reliability in the assessment of head tilt and active neck rotation in the upright position for infants with torticollis. In both cases, assessment of rotation was more reliable than that of head tilt. Using an ICC value of ≥0.7 for a test to be clinically acceptable, inter-rater reliability of head tilt was found to be unacceptable. There was a wide variation in reliability and no correlation was found between reliability and clinical experience. Therefore, it is recommended that physiotherapists test their own reliability if possible, and that an alternative tool for the assessment of head tilt be explored.
- Research Article
32
- 10.1016/j.jhsa.2011.10.056
- Jan 28, 2012
- The Journal of Hand Surgery
Reliability and Clinical Importance of Teardrop Angle Measurement in Intra-articular Distal Radius Fracture
- Research Article
2
- 10.1590/1413-785220212902236763
- Jan 1, 2021
- Acta ortopedica brasileira
ABSTRACTObjective: To evaluate the reproducibility of a S2-alar iliac (S2AI) screw parameters measurement method by inter and intraobserver reliability.Methods: Cross-sectional study, considering computed tomography exams. Morphometric analysis was performed by multiplanar reconstructions. Screw length, diameter and trajectory angles were the studied variables. To analyze the measurements reproducibility, intraclass correlation coefficient (ICC) was used.Results: Interobserver reliability was classified as strong for screw shortest length (ICC: 0.742) and diameter (ICC: 0.699). Interobserver reliability was classified as moderate for screw longest length (ICC: 0.553) and for screw trajectory angles in the axial plane for the longest (ICC: 0.478) and for the shortest lengths (ICC: 0.591). Intraobserver reliability was interpreted as excellent for screw shortest (ICC: 0.932) and longest lengths (ICC: 0.962) and diameter (ICC: 0.770) and screw trajectory angles in the axial plane for the screw longest (ICC: 0.773) and shortest lengths (ICC: 0.862). There were weak interobserver and strong intraobserver reliabilities for trajectory angle in sagittal plane, but no statistical significance was found.Conclusion: Inter and intraobserver reliability of S2AI screw morphometric parameters were interpreted from moderate to excellent in almost all studied variables, except for the screw trajectory angle in the sagittal plane measurement. Level of Evidence IV, Diagnostic Studies - Investigating a Diagnostic Test.
- Research Article
2
- 10.1016/j.surg.2024.11.002
- Feb 1, 2025
- Surgery
BackgroundRecently, a competency assessment tool has been developed within the RIGHT project, a national quality improvement program for minimally invasive right hemicolectomy in patients with colon cancer. This study aimed to evaluate whether trained medical students can reliably evaluate minimally invasive right hemicolectomy videos using a competency assessment tool. MethodsNine expert colorectal surgeons, 13 trained medical students, and 17 untrained medical students assessed the surgical quality of 6 full-length minimally invasive right hemicolectomy videos with the competency assessment tool. The expert surgeons were trained using the competency assessment tool by the RIGHT project leaders, who were also involved in the development and validation of the competency assessment tool. Training for medical students included anatomy, step-by-step procedure explanation, and competency assessment tool review with 2 supervised video assessments. The untrained students were taught only anatomy and minimally invasive right hemicolectomy steps. The intraclass correlation coefficient was calculated to determine inter-rater reliability, and analysis of variance with the Bonferroni correction for multiple testing was used to assess potential differences between the groups per video. ResultsThe trained students demonstrated an overall excellent inter-rater reliability (intraclass correlation coefficient score of 0.885). When their scores were combined with those of the expert surgeons, a high inter-rater reliability was also demonstrated (intraclass correlation coefficient score of 0.945). Trained students consistently aligned with surgeons’ mean total scores, also accurately identifying lower quality surgeries. Untrained students assigned statistically significantly higher scores to the 3 lower quality surgeries as compared with expert surgeons and trained students. ConclusionAmong trained students, excellent inter-rater reliability and concordance with expert colorectal surgeons was found. The study highlights the potential to engage trained medical students for objective minimally invasive right hemicolectomy video assessment.
- Research Article
4
- 10.1097/pec.0000000000002836
- Sep 7, 2022
- Pediatric Emergency Care
Capillary refill time (CRT) to assess peripheral perfusion in children with suspected shock may be subject to poor reproducibility. Our objectives were to compare video-based and bedside CRT assessment using a standardized protocol and evaluate interrater and intrarater consistency of video-based CRT (VB-CRT) assessment. We hypothesized that measurement errors associated with raters would be low for both standardized bedside CRT and VB-CRT as well as VB-CRT across raters. Ninety-nine children (aged 1-12 y) had 5 consecutive bedside CRT assessments by an experienced critical care clinician following a standardized protocol. Each CRT assessment was video recorded on a black background. Thirty video clips (10 with bedside CRT < 1 s, 10 with CRT 1-2 s, and 10 with CRT > 2 s) were randomly selected and presented to 10 clinicians twice in randomized order. They were instructed to push a button when they visualized release of compression and completion of a capillary refill. The correlation and absolute difference between bedside and VB-CRT were assessed. Consistency across raters and within each rater was analyzed using the intraclass correlation coefficient (ICC). A Generalizability study was performed to evaluate sources of variation. We found moderate agreement between bedside and VB-CRT observations (r = 0.65; P < 0.001). The VB-CRT values were shorter by 0.17 s (95% confidence interval, 0.09-0.25; P < 0.001) on average compared with bedside CRT. There was moderate agreement in VB-CRT across raters (ICC = 0.61). Consistency of repeated VB-CRT within each rater was moderate (ICC = 0.71). Generalizability study revealed the source of largest variance was from individual patient video clips (57%), followed by interaction of the VB-CRT reviewer and patient video clip (10.7%). Bedside and VB-CRT observations showed moderate consistency. Using video-based assessment, moderate consistency was also observed across raters and within each rater. Further investigation to standardize and automate CRT measurement is warranted.
- Research Article
- 10.1093/ehjimp/qyag069
- Jan 1, 2026
- European heart journal. Imaging methods and practice
Recent studies highlight that left ventricular (LV) and right ventricular (RV) strain-volume/area interactions, particularly systolic slope and coupling parameters, carry clinical and physiological relevance. This study evaluated the intra-observer, inter-observer, and test-retest reliability of echocardiographic LV and RV strain-volume/area loops. Twenty-nine healthy adults underwent two transthoracic echocardiograms 2 h apart after standardized preparation. One observer analysed the first scan twice (intra-observer reliability) and the second scan once (test-retest reliability). A second observer analysed the first scan once (inter-observer reliability). Observers were blinded and analysed data independently. Reliability was assessed for systolic (systolic slope [SS], peak strain [PS]) and coupling parameters (early [EarlyU] and late diastolic uncoupling [LateU]), using intra-class correlation coefficients (ICCs) and Bland-Altman analyses. ICCs were generally higher for LV strain-volume than for RV strain-area loops. For LV, intra-/inter-observer and test-retest reliability was good-to-excellent for SS (ICCs: 0.84-0.92), moderate-to-good for PS and EarlyU (ICCs: 0.64-0.85 and 0.60-0.87, respectively), and poor-to-good for LateU (ICCs: 0.48-0.78). For RV, reliability was good for SS (ICCs: 0.78-0.89), poor-to-moderate for PS (ICCs: 0.19-0.59), moderate for EarlyU and LateU (ICCs: 0.53-0.68, and 0.60-0.73, respectively). Systematic bias was minimal. Reliability was superior for LV-based parameters compared to RV. Both the LV and RV loops showed moderate-to-excellent reliability for SS and EarlyU, whilst reliability for PS and LateU varied from poor-to-good. These findings provide a methodological basis for future studies applying strain-volume and strain-area loops.
- Research Article
1
- 10.1186/s12891-025-09201-x
- Sep 23, 2025
- BMC Musculoskeletal Disorders
BackgroundDigital health technologies are advancing rapidly, with an increasing number of physiotherapists favoring real-time, video-based platforms over telephone-based modalities. Osteoarthritis affects an estimated 595 million individuals worldwide, with approximately 62% of cases involving the knee joint. However, evidence regarding the validity and reliability of video-based assessments for knee osteoarthritis (KOA) remains scarce. The purpose of this pilot study was to explore the feasibility of video-based physiotherapy assessment of KOA in patients with knee pain. Additionally, we aimed to provide preliminary data on its concurrent validity and interrater reliability compared with conventional face-to-face assessment.MethodsA cross-sectional validity and reliability pilot study was conducted in June 2024. Participants were recruited through public advertisements. Eligible individuals were aged 45 years or older and reported knee pain. Each participant underwent both a real-time video-based physiotherapy assessment and a conventional, face-to-face assessment. The video-based assessments were recorded for later analysis. Concurrent validity was examined by determining the exact or potential agreement between the video-based and face-to-face assessments. Interrater reliability was evaluated by comparing the live video-based assessments with those obtained from the recorded video-based assessments.ResultsFor concurrent validity, exact agreement was observed in 28 of 35 cases (80%; κ = 0.35), indicating fair agreement. Potential agreement was achieved in 33 of 35 cases (94%; κ = 0.64), indicating substantial agreement. Interrater reliability demonstrated exact agreement in 25 of 29 cases (86%; κ = 0.52), corresponding to moderate agreement. Potential agreement for interrater reliability was observed in 27 of 29 cases (93%; κ = 0.63), corresponding to substantial agreement.ConclusionsVideo-based physiotherapy assessment appears feasible and may provide preliminary indications of validity for diagnosing KOA in individuals with nontraumatic knee pain. The results suggest acceptable interrater agreement and highlight the need for more standardized digital assessment protocols to ensure consistent and reliable use in clinical practice.Trial registrationISRCTN Registry (ISRCTN41057250), 09/05/2025. Retrospectively registered. Prospectively registered in FoU in VGR (researchweb.org) 282608, Date of registration 26/03/2024.Supplementary InformationThe online version contains supplementary material available at 10.1186/s12891-025-09201-x.