RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.