In recent years, autonomous vehicles have attracted many researchers from academia and industry. Their efforts include object detection and tracking in the fields of computer vision, probabilistic-fusion and decision-making algorithms, etc. But most current methods that directly map from image pixels to steering behavior are not able to generate accurate results, especially under scenarios such as light and weather change drastically; this is largely due to the fact that these methods ignore the temporal relationship between frames. In this paper, we propose a novel end-to-end deep learning framework based on temporal and spatial attention mechanism, which aims to solve the problem of inaccurate vehicle steering angle prediction in complex environments and the difficulty of model interpretation. First, we use video sequence and historical steering angle sequence as inputs to the model, instead of just using a single frame as input. Second, we design a temporal attention mechanism to capture the long- and short-term memory in input visual information, and a spatial attention mechanism to capture key objects in the image and obtain their position information. This is achieved by inserting carefully-designed SE-Net, ConvLSTM and CNN layers into the appropriate layers of the network framework. Finally, we demonstrate the feasibility of the proposed model with public Comma2k19, with comparison to current advanced methods. Experimental results show that compared with state-of-the-art methods, the average absolute error (MAE) values of our model on the training set and testing set are reduced by 10.2% and 6.3%, respectively, and has more accurate steering prediction performance. In addition, we explain the trigger mechanism of steering behavior prediction by visualizing the spatial attention map and temporal attention score on Comma2k19 and Udacity datasets, which further demonstrates that the proposed model can learn human-like driving behavior.