Accurate dynamic driver head pose tracking is of great importance for driver–automotive collaboration, intelligent copilot, head-up display (HUD), and other human-centered automated driving applications. To further advance this technology, this article proposes a low-cost and markerless head-tracking system using a deep learning-based dynamic head pose estimation model. The proposed system requires only a red, green, blue (RGB) camera without other hardware or markers. To enhance the accuracy of the driver’s head pose estimation, a spatiotemporal vision transformer (ST-ViT) model, which takes an image pair as the input instead of a single frame, is proposed. Compared to a standard transformer, the ST-ViT contains a spatial–convolutional vision transformer and a temporal transformer, which can improve the model performance. To handle the error fluctuation of the head pose estimation model, this article proposes an adaptive Kalman filter (AKF). By analyzing the error distribution of the estimation model and the user experience of the head tracker, the proposed AKF includes an adaptive observation noise coefficient; this can adaptively moderate the smoothness of the curve. Comprehensive experiments show that the proposed system is feasible and effective, and it achieves a state-of-the-art performance.