Multimodal sentiment analysis models can determine users’ sentiments by utilizing rich information from various sources (e.g., textual, visual, and audio). However, there are two key challenges when deploying the model in real-world environments: (1) the limitations of relying on the performance of automatic speech recognition (ASR) models can lead to errors in recognizing sentiment words, which may mislead the sentiment analysis of the textual modality, and (2) variations in information density across modalities complicate the development of a high-quality fusion framework. To address these challenges, this paper proposes a novel Multimodal Sentiment Word Optimization Module and a heterogeneous hierarchical fusion (MSWOHHF) framework. Specifically, the proposed Multimodal Sentiment Word Optimization Module optimizes the sentiment words extracted from the textual modality by the ASR model, thereby reducing sentiment word recognition errors. In the multimodal fusion phase, a heterogeneous hierarchical fusion network architecture is introduced, which first utilizes a Transformer Aggregation Module to fuse the visual and audio modalities, enhancing the high-level semantic features of each modality. A Cross-Attention Fusion Module then integrates the textual modality with the audiovisual fusion. Next, a Feature-Based Attention Fusion Module is proposed that enables fusion by dynamically tuning the weights of both the combined and unimodal representations. It then predicts sentiment polarity using a nonlinear neural network. Finally, the experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets show that the MSWOHHF outperforms several baselines, demonstrating better performance.