Abstract Addressing the challenges of inadequate accuracy and limited robustness exhibited by current lightweight object detection networks specifically tailored for low-resolution thermal infrared face detection scenarios, this paper delves into developing an ultra-lightweight thermal infrared face detection algorithm that leverages visual attention mechanisms. To ascertain the optimal neural network complexity, a series of comparative experiments are meticulously conducted. With Yolo-FastestDet serving as the benchmark, this study endeavors to compress the backbone network, striking a delicate balance between network depth and detection speed. Additionally, to bolster the network’s capacity for profound feature extraction and precise discrimination of target edges and small objects, a Transformer-Encoder-based visual attention module is seamlessly integrated. Consequently, a lightweight face detection algorithm, enriched with attention mechanisms, is formulated. Furthermore, to mitigate the scarcity of low-resolution infrared face image data, a self-constructed visible-infrared face dataset is employed for training and evaluation purposes. The experimental outcomes reveal that the proposed algorithm attains an impressive mAP@0.5 score of 0.953 on the test dataset while satisfying the stringent real-time detection criterion of 30 frames per second (FPS) when deployed on an embedded Raspberry Pi CPU.