To assess vestibular function, video head impulse test (vHIT) is taken as the gold standard by evaluating the vestibulo-ocular reflex (VOR). However, vHIT requires the patient to wear a specialized head-mounted goggle equipment that needs to be calibrated before each use. For this, we proposed an intelligent head impulse test (iHIT) setting with a monocular infrared camera instead of the head-mounted goggle and contributed correspondingly a video classification approach with deep learning to vestibular function determination. Within the iHIT framework, a monocular infrared camera was set in front of the patient to capture test videos, based on which a dataset DiHIT of HIT video clips was set up. We then proposed a two-stage multi-modal video classification network, trained on the dataset DiHIT, that took as input the eye motion and head motion data extracted from the facial keypoints via HIT clips and outputted the identification of the semicircular canal (SCC) being tested (SCC identification) and determination of VOR abnormality (SCC qualitation). Experiments on this dataset DiHIT showed that it achieved the accuracy of 100% in prediction of SCC identification. Furthermore, it attained predictive accuracies of 84.1% in horizontal and 79.0% in vertical SCC qualitation. Compared with existing video-based HIT, iHIT eliminates goggles, does not require equipment calibration, and achieves complete automation. Furthermore, iHIT will bring more benefits to users due to its low cost and ease of operation. Codes and use case pipeline are available at: https://github.com/dec1st2023/iHIT. 3 Laryngoscope, 2024.