Satellite video scene classification (SVSC) is an advanced topic in the remote sensing field, which refers to determine the video scene categories from satellite videos. SVSC is an important and fundamental step for satellite video analysis and understanding, which provides priors for the presence of objects and dynamic events. In this paper, a two-stage framework is proposed to extract spatial features and motion features for SVSC. More specifically, the first stage is designed to extract spatial features for satellite videos. Representative frames are firstly selected based on the blur detection and spatial activity of satellite videos. Then the fine-tuned visual geometry group network (VGG-Net) is transferred to extract spatial features based on spatial content. The second stage is designed to build motion representation for satellite videos. The motion representation of moving targets in satellite videos is first built by the second temporal principal component of principal component analysis (PCA). Second, features from the first fully connected layer of VGG-Net are used as high-level spatial representation for moving targets. Third, a small network of long and short term memory (LSTM) is further designed for encoding temporal information. Two-stage features respectively characterize spatial and temporal patterns of satellite scenes, which are finally fused for SVSC. A satellite video dataset is built for video scene classification, including 7209 video segments and covering 8 scene categories. These satellite videos are from Jilin-1 satellites and Urthecast. The experimental results show the efficiency of our proposed framework for SVSC.