Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities effectively when fusing modal information. However, existing approaches consider the same representation ability of each modality, resulting in unsatisfactory fusion across modalities. Therefore, we consider different modalities with different representation abilities, propose the concept of the main modal, i.e., the modal with stronger representation ability after feature extraction, and then propose the method of Main Modal Transformer (MMTr) to improve the effect of multimodal fusion. The method preserves the integrity of the main modal features and enhances the representation of weak modalities by using multihead attention to learn the information interactions between modalities. In addition, we designed a new emotional cue extractor that extracts emotional cues from two levels (the speaker’s self-context and the contextual context in conversation) to enrich the conversation information obtained by each modal. Extensive experiments on two benchmark datasets validate the effectiveness and superiority of our model.