In interior interaction design, achieving intelligent user-interior interaction is contingent upon understanding the user’s emotional responses. Precise identification of the user’s visual emotions holds paramount importance. Current visual emotion recognition methods rely solely on singular features, predominantly facial expressions, resulting in inadequate coverage of visual characteristics and low recognition rates. This study introduces a deep learning-based multimodal weighting network model to address this challenge. The model initiates with a convolutional attention module, employing a self-attention mechanism within a convolutional neural network (CNN). As a result, the multimodal weighting network model is integrated to optimize weights during training. Finally, a weight network classifier is derived from these optimized weights to facilitate visual emotion recognition. Experimental outcomes reveal a 77.057% correctness rate and a 74.75% accuracy rate in visual emotion recognition. Comparative analysis against existing models demonstrates the superiority of the multimodal weight network model, showcasing its potential to enhance human-centric and intelligent indoor interaction design.