Cryo-EM in single particle analysis is known to have low SNR and requires to utilize several frames of the same particle sample to restore one high-quality image for visualizing that particle. However, the low SNR of cryo-EM movie and motion caused by beam striking make the task very challenging. Video enhancement algorithms in computer vision shed new light on tackling such tasks by utilizing deep neural networks. However, they are designed for natural images with high SNR. Meanwhile, the lack of ground truth in cryo-EM movie seems to be one major limiting factor of the progress. Hence, we present a synthetic cryo-EM movie generation pipeline, which can produce realistic diverse cryo-EM movie datasets with low-SNR movie frames and multiple ground truth values. Then we propose a deep spatio-temporal network (DST-Net) for cryo-EM movie frame enhancement trained on our synthetic data. Spatial and temporal features are first extracted from each frame. Spatio-temporal fusion and high-resolution re-constructor are designed to obtain the enhanced output. For evaluation, we train our model on seven synthetic cryo-EM movie datasets and infer on real cryo-EM data. The experimental results show that DST-Net can achieve better enhancement performance both quantitatively and qualitatively compared with others.