Understanding the emotional polarity of multimodal content with metaphorical characteristics, such as memes, poses a significant challenge in Multimodal Emotion Recognition (MER). Previous MER researches have overlooked the phenomenon of metaphorical alignment in multimedia content, which involves non-literal associations between concepts to convey implicit emotional tones. Metaphor-agnostic MER methods may be misinformed by the isolated unimodal emotions, which are distinct from the real emotions blended in multimodal metaphors. Moreover, contextual semantics can further affect the emotions associated with similar metaphors, leading to the challenge of maintaining contextual compatibility. To address the issue of metaphorical alignment in MER, we propose to leverage a conditional generative approach for capturing metaphorical analogies. Our approach formulates schematic prompts and corresponding references based on theoretical foundations, which allows the model to better grasp metaphorical nuances. In order to maintain contextual sensitivity, we incorporate a disentangled contrastive matching mechanism, which undergoes curricular adjustment to regulate its intensity during the learning process. The automatic and human evaluation experiments on two benchmarks prove that, our model provides considerable and stable improvements in recognizing multimodal emotion with metaphor attributes.