Two-stream network architecture has the ability to capture temporal and spatial features from videos simultaneously and has achieved excellent performance on video action recognition tasks. However, there is a fair amount of redundant information in both temporal and spatial dimensions in videos, which increases the complexity of network learning. To solve this problem, we propose residual spatial-temporal attention network (R-STAN), a feed-forward convolutional neural network using residual learning and spatial-temporal attention mechanism for video action recognition, which makes the network focus more on discriminative temporal and spatial features. In our R-STAN, each stream is constructed by stacking residual spatial-temporal attention blocks (R-STAB), the spatial-temporal attention modules integrated in the residual blocks have the ability to generate attention-aware features along temporal and spatial dimensions, which largely reduce the redundant information. Together with the specific characteristic of residual learning, we are able to construct a very deep network for learning spatial-temporal information in videos. With the layers going deeper, the attention-aware features from the different R-STABs can change adaptively. We validate our R-STAN through a large number of experiments on UCF101 and HMDB51 datasets. Our experiments show that our proposed network combined with residual learning and spatial-temporal attention mechanism contributes substantially to the performance of video action recognition.