Moving small target detection is an eye-catching research topic, full of great challenges in object detection field. Its primary difficulties exist in (i) the very small size proportion of the target region in background images, (ii) the quite weak contrast to background and (iii) the perception of slight target motion. Currently-existing detection methods mainly utilize the image features from spatial domain. This kind of features is quite weak to small targets. More new feature kinds need to be concerned. In view of these, besides traditional image clues, motion features are currently attracting more and more attention in small target detection. In order to promote detection performance, this paper proposes a Spatio-Temporal Fusion Network (STFNet) with two parallel feature extraction branches for detecting moving small targets. In this network, one branch is designed to capture the traditional semantic features from spatial domain, and the other is devised to perceive the slight target motion features hidden in temporal domain. Meanwhile, to optimize the extraction of spatio-temporal features, a group of motion masks is specially designed to guide feature extractors to pay more precise attention to the positions where small targets appear. Moreover, we design a new bridging module to enhance the cross-domain and cross-scale fusion of spatio-temporal features. Through extensive comparison and ablation experiments on seven sub-datasets, it is proved that our new STFNet is effective and it is obviously superior to the compared ones in detecting moving small targets from satellite sequence images. It achieves a Precision of 0.93, an mAP50 of 71.10% and an F1 of 0.84, evidently higher than existing state-of-the-art methods. Our codes are available at https://github.com/UESTC-nnLab/STF.