In recent years, object detection in drone videos has garnered widespread attention. Conventional image-based detectors have encountered a performance bottleneck for drone images due to the severe appearance deterioration of targets and numerous background clutters. Video object detection (VOD) is a promising paradigm for such problems as it can effectively enhance target appearance by aggregating features from nearby frames in drone videos. However, struggling with the feature alignment problem which is particularly important for drone images where the background varies frequently. Following the VOD philosophy, we propose the Recurrent Motion Attention Network (RMA-Net) to extract spatial-temporal image features for object detection in drone vision. A recurrent fashion is designed to effectively and efficiently learn motion features in long sequences, only requiring a few frames of optical flows for initialization. Experiments demonstrate that the proposed method could easily promote existing object detectors to achieve high performance on drone videos. Specifically, our RMA adapted YOLOv7 outperforms the state-of-the-arts by a large margin on the challenging VisDrone2019-VID dataset, not only boosts the mean average precision by 6.99 % but also reaches a batch-1 inference speed of 31 frames per second.