Introduction. Prior to graduation, paramedic students must be assessed for terminal competency and preparedness for national credentialing examinations. Although the procedures for determining competency vary, many academic programs use a practical and/or oral examination, often scored using skill sheets, for evaluating psychomotor skills. However, even with validated testing instruments, the interevaluator reliability of this process is unknown. Objective. We sought to estimate the interevaluator reliability of a subset of paramedic skills as commonly applied in terminal competency testing. Methods. A mock examinee was videotaped performing staged examinations mimicking adult ventilatory management, oral board, and static and dynamic cardiac stations during which the examinee committed a series of prespecified errors. The videotaped performances were then evaluated by a group of qualified evaluators using standardized skill sheets. Interevaluator variability was measured by standard deviation and range, and reliability was evaluated using Krippendorff's alpha. Correlation between scores and evaluator demographics was assessed by Pearson correlation. Results. Total scores and critical errors varied considerably across all evaluators and stations. The mean (± standard deviation) scores were 24.77 (±2.37) out of a possible 27 points for the adult ventilatory management station, 11.69 (±2.71) out of a possible 15 points for the oral board station, 7.79 (±3.05) out of a possible 12 points for the static cardiology station, and 22.08 (±1.46) out of a possible 24 points for the dynamic cardiology station. Scores ranged from 18 to 27 for adult ventilatory management, 7 to 15 for the oral board, 2 to 12 for static cardiology, and 19 to 24 for dynamic cardiology. Krippendorff's alpha coefficients were 0.30 for adult ventilatory management, 0.01 for the oral board, 0.10 for static cardiology, and 0.48 for dynamic cardiology. Critical criteria errors were assigned by 10 (38.5%) evaluators for adult ventilatory management, five (19.2%) for the oral board, and nine (34.6%) for dynamic cardiology. Total scores were not correlated with evaluator demographics. Conclusions. There was high variability and low reliability among qualified evaluators using skill sheets as a scoring tool in the evaluation of a mock terminal competency assessment. Further research is needed to determine the true overall interevaluator reliability of this commonly used approach, as well as the ideal number, training, and characteristics of prospective evaluators.