Due to the widespread adoption of social networks, image-text comments have become a prevalent mode of emotional expression compared to traditional text descriptions. However, there are currently two major challenges. The first is the question of how to extract rich representations effectively from both text and images, and the second is the question of how to extract cross-modal shared emotion features. This study proposes a multimodal sentiment analysis method based on a deep feature interaction network (DFINet). It leverages word-to-word graphs and deep attention interaction networks (DAIN) to learn text representations effectively from multiple subspaces. Additionally, it introduces a cross-modal attention interaction network to extract cross-modal shared emotion features efficiently. This approach helps alleviate the difficulties associated with acquiring image-text features and representing cross-modal shared emotion features. Experimental results on the Yelp dataset demonstrate the effectiveness of the DFINet method.