AbstractIn the field of autonomous vehicles, accurately predicting steering angle and speed is a pivotal task. This task affects the accuracy of the final decision of the autonomous vehicle and is the basis for ensuring the safe and efficient operation of the autonomous vehicle. Previous studies have often relied on data from only one or two modalities to make predictions for steering angle and vehicle speed, which were often inadequate. In this paper, the authors propose a Multi-Modal Fusion-Based End-to-End Steering Angle and Vehicle Speed Prediction Network (MFE-SSNet). The network innovatively extends the one-stream and two-stream structure to a three-stream structure and cleverly extracts features of images, steering angles, and vehicle speeds using HRNet and LSTM layers. In addition, in order to fully fuse the feature information of different modal data, this paper also proposes a local attention-based feature fusion module. This module improves the fusion of different modal feature vectors by capturing the interdependencies in the local channels. Experimental results demonstrate that MFE-SSNet outperforms the current state-of-the-art model on the publicly available Udacity dataset.