Indoor location-based services constitute an important part of our daily lives, providing position and direction information about people or objects in indoor spaces. These systems can be useful in security and monitoring applications that target specific areas such as rooms. Vision-based scene recognition is the task of accurately identifying a room category from a given image. Despite years of research in this field, scene recognition remains an open problem due to the different and complex places in the real world. Indoor environments are relatively complicated because of layout variability, object and decoration complexity, and multiscale and viewpoint changes. In this paper, we propose a room-level indoor localization system based on deep learning and built-in smartphone sensors combining visual information with smartphone magnetic heading. The user can be room-level localized while simply capturing an image with a smartphone. The presented indoor scene recognition system is based on direction-driven convolutional neural networks (CNNs) and therefore contains multiple CNNs, each tailored for a particular range of indoor orientations. We present particular weighted fusion strategies that improve system performance by properly combining the outputs from different CNN models. To meet users' needs and overcome smartphone limitations, we propose a hybrid computing strategy based on mobile computation offloading compatible with the proposed system architecture. The implementation of the scene recognition system is split between the user's smartphone and a server, which aids in meeting the computational requirements of CNNs. Several experimental analysis were conducted, including to assess performance and provide a stability analysis. The results obtained on a real dataset show the relevance of the proposed approach for localization, as well as the interest in model partitioning in hybrid mobile computation offloading. Our extensive evaluation demonstrates an increase in accuracy compared to traditional CNN scene recognition, indicating the effectiveness and robustness of our approach.