Multimodal emotion recognition in conversations (ERC) aims to identify the emotional state of constituent utterances expressed by multiple speakers in dialogue from multimodal data. Existing multimodal ERC approaches focus on modeling the global context of the dialogue and neglect to mine the characteristic information from the corresponding utterances expressed by the same speaker. Additionally, information from different modalities exhibits commonality and diversity for emotional expression. The commonality and diversity of multimodal information are compensated for each other but not effectively exploited in previous multimodal ERC works. To tackle these issues, we propose a novel Multimodal Adversarial Learning Network (MALN). MALN first mines the speaker’s characteristics from context sequences and then incorporate them with the unimodal features. Afterward, we design a novel adversarial module AMDM to exploit both commonality and diversity from the unimodal features. Finally, AMDM fuses different modalities to generate refined utterance representations for emotion classification. Extensive experiments are conducted on two public multimodal ERC datasets, IEMOCAP and MELD. Through the experiments, MALN shows its superiority over the state-of-the-art methods.