Despite the advancement of eye-tracking technology for smooth pursuit (SP) eye movement evaluation, qualitative observation offers much information that is not captured by computers; hence, both objective and qualitative information should be utilized to evaluate SP. This study examined the consistency among our clinicians when evaluating SP using normal (N), grossly normal (GN), mildly abnormal (MA), and abnormal (AB) as classifications. We then evaluated the effect of combining GN and MA into a single subclinical (SUBC) category. We also evaluated the computerized percent saccade (PS) metric by determining its sensitivity and specificity in classifying SP. Retrospective horizontal and vertical SP test videos and numerical data for 70 participants were obtained from the Neuro Kinetics Neuro-Otologic Test Center and de-identified. From this, eye-tracking videos, time plots of eye-tracking positional data, and tables of SP eye-tracking performance data were generated for 0.1, 0.3, and 0.5 Hz in both horizontal and vertical planes, totaling 6 tests per subject. Three clinicians rated each subject's SP performance as N, GN, MA, or AB for a total of 6 ratings (3 frequencies, horizontal and vertical). This process was repeated using N, SUBC, and AB as rating categories. Clinicians also provided an overall SP rating for each plane as follows: AB if the results were abnormal for 2 or more frequencies tested. Alternatively, if fewer than 2 frequencies presented with a rating of AB, then an overall rating of MA, GN, or N was determined at the respective clinician's discretion. When the 3 clinicians were tasked with classifying SP videos using 4 clinical categories, fair overall agreement was demonstrated. However, when MA and GN categories were combined into an SUBC category, the overall agreement for the 3 clinicians improved slightly for both horizontal SP (HSP) and vertical SP (VSP). This pattern of agreement did not differ considerably when comparing HSP versus VSP, and good consistency and reliability was observed across clinicians. Again, inter-rater consistency was smaller for VSP versus HSP despite the reduction in clinical categories. Cut-off values were generated for the PS metric and demonstrated good specificity and sensitivity when they were exceeded for 2 or more frequencies in a particular plane when evaluating a subject's SP test.