Ego-motion estimation plays a critical role in autonomous driving systems by providing accurate and timely information about the vehicle’s position and orientation. To achieve high levels of accuracy and robustness, it is essential to leverage a range of sensor modalities to account for highly dynamic and diverse scenes, and consequent sensor limitations.In this work, we introduce TEFu-Net, a Deep-Learning-based late fusion architecture that combines multiple ego-motion estimates from diverse data modalities, including stereo RGB, LiDAR point clouds and GNSS/IMU measurements. Our approach is non-parametric and scalable, making it adaptable to different sensor set configurations. By leveraging a Long Short-Term Memory (LSTM), TEFu-Net produces reliable and robust spatiotemporal ego-motion estimates. This capability allows it to filter out erroneous input measurements, ensuring the accuracy of the car’s motion calculations over time. Extensive experiments show an average accuracy increase of 63% over TEFu-Net’s input estimators and on par results with the state-of-the-art in real-world driving scenarios. We also demonstrate that our solution can achieve accurate estimates under sensor or input failure. Therefore, TEFu-Net enhances the accuracy and robustness of ego-motion estimation in real-world driving scenarios, particularly in challenging conditions such as cluttered environments, tunnels, dense vegetation, and unstructured scenes. As a result of these enhancements, it bolsters the reliability of autonomous driving functions.