Due to recent technological developments such as online navigation, augmented reality (AR), virtual reality (VR), and digital twins, and the high demand from users for various location-based services (LBS), research on location estimation techniques is being actively conducted. As a result, there is an increasing demand for effective localization technologies that can be used in places where the use of Global Positioning System (GPS) is limited, especially in indoor spaces with very large areas. In this paper, a new structure for an indoor localization system in which wireless fingerprinting and visual-based positioning are hierarchically combined—the so-called Fi-Vi system—is proposed. This scheme consists of two steps: fingerprint-based localization (FBL) followed by visual-based localization (VBL). In the first positioning step (i.e., the FBL stage), the entire area of a significantly broad range for localization is divided into multiple regions, the size and the number of which depend on the target accuracy of this step. Moreover, a machine-learning (ML) or deep-learning (DL) model trained on a Wi-Fi fingerprint radio map selects suitable candidate regions among these multiple regions. In the second positioning step (i.e., the VBL stage), the final location is precisely estimated through visual-based positioning based on the received information regarding the candidate regions. The FBL stage uses a sparse radio map (SRM) for fingerprinting, which can be constructed with relatively little effort and cost compared to radio maps used in conventional fingerprinting methods. As a result, it can be easily combined with existing visual-based positioning methods with almost negligible implementation complexity. Because of the hierarchical structure and SRM, the proposed scheme shows a significant performance improvement in terms of computational load and time required for indoor localization compared to the use of the existing visual-based indoor positioning method alone. In addition, it provides high accuracy and robustness even in a dynamically changing indoor wireless environment where conventional wireless fingerprinting methods show significant performance degradation. Finally, the performance analysis of the proposed scheme was performed using the UJIIndoorLoc dataset. Experiments and theoretical analysis have shown that when the estimation accuracy of the candidate region for the test dataset was achieved at about 99% through the FBL stage, the average computational amount of the VBL stage for the final position estimation was only about 16% of that in cases where the visual-based positioning method was used alone.