Sarcasm is a form of sentiment expression that highlights the disparity between a person’s true intentions and the content they explicitly present. With the exponential increase in multimodal data on social platforms, the detection of sarcasm across various modes has become a pivotal area of research. Although previous studies have extensively examined multimodal feature extraction, fusion, and the modeling of inter-modal incongruities, they often neglected the subtle sentiment cues inherent in sarcastic multimodal data. Additionally, they did not adequately address the sparse distribution and tenuous connections between sarcastic features both within and cross modalities. To address these gaps, we introduce a hierarchical fusion model that integrates sentiment information for enhanced multimodal sarcasm detection. Specifically, we use attribute-object matching in the image modality, treating it as an auxiliary attribute modality. Sentiment data is then extracted from each modality and combined to achieve a more comprehensive representation within modalities. Moreover, we characterize the relationships of inter-modal incongruities using a crossmodal Transformer. We also implement a sentiment-aware image-text contrastive loss mechanism to synchronize the semantics of images and text better. By intensifying these alignments, our model is better equipped to understand incongruous relationships. Experiments demonstrate that our hierarchical fusion model achieves state-of-the-art performance on the multimodal sarcasm detection task.