With the rapid development of the building industry, intelligent buildings benefit from its safety, energy saving, environmental protection and integration and other advantages have been widely loved by people, most operators also realize the importance of intelligent buildings to bring people humanized and customized services, and in order to realize the personalized service of the building, multi-modal data fusion is an effective method. On the other hand, in today’s Internet of Things society, many practical applications need to deploy a large number of sensing equipment for data collection and processing, so as to carry out high-quality monitoring of the physical world, but due to the inherent limitations of these hardware equipment and the influence of factors such as the environment, single mode data often cannot be completely and comprehensively monitored to the physical world’s changing characteristics. In this development context, multi-modal data fusion has become a research hotspot in the field of machine learning. Based on this, this paper proposes a one-stage fast object detection model with multi-level fusion of multi-modal features and end-to-end characteristics for building indoor environment perception, and conducts experimental analysis on the performance of the model. The verification results show that the accuracy of the proposed method is 50.7% and the running speed is 0.107 s, which has better performance than the existing detection methods.