Rater Drift Research Articles

The purpose of this study was to assess inter-rater reliability and intra-rater reliability of the 2-minute, 90° push-up test as utilized in the Army Physical Fitness Test. Analysis of rater assessment reliability included both total score agreement and agreement across individual push-up repetitions. This study utilized 8 Raters who assessed 15 different videotaped push-up performances over 4 iterations separated by a minimum of 1 week. The 15 push-up participants were videotaped during the semiannual Army Physical Fitness Test. Each Rater randomly viewed the 15 push-up and verbally responded with a "yes" or "no" to each push-up repetition. The data generated were analyzed using the Pearson product-moment correlation as well as the kappa, modified kappa and the intra-class correlation coefficient (3,1). An attribute agreement analysis was conducted to determine the percent of inter-rater and intra-rater agreement across individual push-ups.The results indicated that Raters varied a great deal in assessing push-ups. Over the 4 trials of 15 participants, the overall scores of the Raters varied between 3.0 and 35.7 push-ups. Post hoc comparisons found that there was significant increase in the grand mean of push-ups from trials 1-3 to trial 4 (p < 0.05). Also, there was a significant difference among raters over the 4 trials (p < 0.05). Pearson correlation coefficients for inter-rater and intra-rater reliability identified inter-rater reliability coefficients were between 0.10 and 0.97. Intra-rater coefficients were between 0.48 and 0.99. Intra-rater agreement for individual push-up repetitions ranged from 41.8% to 84.8%. The results indicated that the raters failed to assess the same push-up repetition with the same score (below 70% agreement) as well as failed to agree when viewed between raters (29%). Interestingly, as previously mentioned, scores on trial 4 increased significantly which might have been caused by rater drift or that the Raters did not maintain the push-up standard over the trials. It does appear that the final push-up scores received by each participant was a close approximation of actual performance (within 65%) but when assessing physical performance for retention in the Army, a more reliable test might be considered.

Read full abstract

Inappropriate subjects may be enrolled in a study when enrollment pressures cause inflated baseline severity scores. An increasing number of studies now include methods such as blinded independent centralized ratings (CR) to ensure that appropriate subjects are entered into the trial. Post-baseline factors such as functional unblinding, expectation bias and rater drift can also affect outcomes. Independent raters, blind to study visit, can minimize functional unblinding and expectation bias. Continuous calibration of CR can minimize rater drift. To examine studies with both site ratings (SR) and CR to determine how critical post-baseline blinding and continuous calibration are. A trial of acute schizophrenia used CR for the PANSS and SR for the BPRS on the same subjects. A Parkinson's psychosis study used CR in the US and SR ex-US to assess subjects using the SAPS. A GAD trial used CR of subjects enrolled by SRs’ SIGH-A evaluations. In the schizophrenia trial, CR separated the active comparator and one of two test arms. SR separated the active comparator but neither test arm. In the Parkinson's psychosis study, pimavanserin showed greater separation with CR than SR. In the GAD trial, CR had lower placebo response than SR, independent of subject selection. Data from several studies support the continued importance of rater blinding and independence, post subject selection. Results suggest that precision of ratings beyond baseline can increase the sensitivity of findings in a clinical trial, decrease placebo response rates and potentially eliminate Type II errors.

Read full abstract

Rater Drift Research Articles

Articles published on Rater Drift

New Tests of Rater Drift in Trend Scoring

The Role of Time on Performance Assessment (Self, Peer and Teacher) in Higher Education: Rater Drift

Examining The Rater Drift in The Assessment of Presentation Skills in Secondary School Context

Automated essay scoring (AES) of constructed responses in nursing examinations: An evaluation

A Mixed Method Program Evaluation of Annual Inspections Conducted in Childcare Programs in Washington State

Examining the Calibration Process for Raters of the GRE® General Test

On the Performance of the Marginal Homogeneity Test to Detect Rater Drift.

Inter-Rater Reliability and Intra-Rater Reliability of Assessing the 2-Minute Push-Up Test.

Trends in Classroom Observation Scores.

The Effect of Observation Length and Presentation Order on the Reliability and Validity of an Observational Measure of Teaching Quality

Using Automated Essay Scores as an Anchor When Equating Constructed Response Writing Tests

Training and Maintaining System-Wide Reliability in Outcome Management

P-641 - The importance of rigor in post-baseline assessments in cns clinical trials

Rater Effects on Essay Scoring: A Multilevel Analysis of Severity Drift, Central Tendency, and Rater Experience

P.2.a.007 Quantifying rater drift on the HAM-D: implications for reliability, sample size, and ongoing training strategy

Interrater Reliability of Using Brief Standardized Outcome Measures in a Community Mental Health Setting

Interrater Reliability of Using Brief Standardized Outcome Measures in a Community Mental Health Setting

P02-88 - Quantifying rater drift on the HAM-D in a sample of standardized rater training events: Implications for reliability and sample size calculations

Inaccuracy in Clinical Trials: Effects and Methods to Control Inaccuracy

P.3.f.002 Quantifying rater drift in a sample of standardised rater training events: Is PANSS reliability maintained over time?

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Rater Drift Research Articles

Articles published on Rater Drift

New Tests of Rater Drift in Trend Scoring

The Role of Time on Performance Assessment (Self, Peer and Teacher) in Higher Education: Rater Drift

Examining The Rater Drift in The Assessment of Presentation Skills in Secondary School Context

Automated essay scoring (AES) of constructed responses in nursing examinations: An evaluation

A Mixed Method Program Evaluation of Annual Inspections Conducted in Childcare Programs in Washington State

Examining the Calibration Process for Raters of the GRE® General Test

On the Performance of the Marginal Homogeneity Test to Detect Rater Drift.

Inter-Rater Reliability and Intra-Rater Reliability of Assessing the 2-Minute Push-Up Test.

Trends in Classroom Observation Scores.

The Effect of Observation Length and Presentation Order on the Reliability and Validity of an Observational Measure of Teaching Quality

Using Automated Essay Scores as an Anchor When Equating Constructed Response Writing Tests

Training and Maintaining System-Wide Reliability in Outcome Management

P-641 - The importance of rigor in post-baseline assessments in cns clinical trials

Rater Effects on Essay Scoring: A Multilevel Analysis of Severity Drift, Central Tendency, and Rater Experience

P.2.a.007 Quantifying rater drift on the HAM-D: implications for reliability, sample size, and ongoing training strategy

Interrater Reliability of Using Brief Standardized Outcome Measures in a Community Mental Health Setting

Interrater Reliability of Using Brief Standardized Outcome Measures in a Community Mental Health Setting

P02-88 - Quantifying rater drift on the HAM-D in a sample of standardized rater training events: Implications for reliability and sample size calculations

Inaccuracy in Clinical Trials: Effects and Methods to Control Inaccuracy

P.3.f.002 Quantifying rater drift in a sample of standardised rater training events: Is PANSS reliability maintained over time?