Many deep neural network-based methods have recently been proposed for object detection due to the significant success of deep learning in computer vision. However, existing object detection methods typically extract the appearance features of objects from single image so that they usually suffer from poor performance in detecting micro Unmanned Aerial Vehicle (UAV), because micro UAV lacks of rich color, shape and texture information. To address this issue, we introduce the temporal information of objects from videos and develop a Spatio-Temporal Attention Module (STAM) to efficiently enhance feature map extraction for detecting micro UAV, and then integrate STAM into YOLOX to develop a video object detector for micro UAV. Meanwhile, we propose a lightweight Spatial Pyramid Pooling (SPP) module termed Group Simplified Spatial Pyramid Pooling-Fast with Cross Stage Partial (Group SimSPPFCSP) for the backbone’s final stage layer to efficiently and lightly extract more semantic information, and we propose a neck with rich propagation pathways (NRPP) to facilitate the effective propagation of spatial and temporal information across different levels. Furthermore, we propose two data augmentation operations including SeqMosaic and SeqMixUp, to augment video data for video object detection. Experimental results show that our model can achieve competitive precision (with 5.0 mAP and 8.1 mAPSmall improvement) while maintaining real-time inference speed (35.3 fps).