There has been extensive research on visual localization and odometry for autonomous robots and virtual reality during the past decades. Traditionally, this problem has been solved with the help of expensive sensors, such as light detection and ranging (LiDAR). Nowadays, the focus of the leading research in this field is on robust localization using more economic sensors, such as cameras and inertial measurement units. Consequently, geometric visual localization methods have become more accurate over time. However, these methods still suffer from significant loss and divergence in challenging environments, such as a room full of moving people. Scientists started using deep neural networks (DNNs) to mitigate this problem. The main idea behind using DNNs is to better understand challenging aspects of the data and overcome complex conditions such as the movement of a dynamic object in front of the camera that covers the full view of the camera, extreme lighting conditions, and the high speed of the camera. Prior end-to-end DNN methods did overcome some of these challenges. However, no general and robust framework is available to overcome all challenges together. In this article, we have combined geometric and DNN-based methods to have the generality and speed of geometric SLAM frameworks and overcome most of these challenging conditions with the help of DNNs and deliver the most robust framework so far. To do so, we have designed a framework based on VINS-Mono and shown that it can achieve state-of-the-art results on TUM-Dynamic, TUM-VI, ADVIO, and EuRoC datasets compared to geometric and end-to-end DNN-based simultaneous localization and mappings. Our proposed framework can also achieve outstanding results on extreme simulated cases resembling the aforementioned challenges.