Perceiving human emotions from a multimodal perspective has received significant attention in knowledge engineering communities. Due to the variable receiving frequency for sequences from various modalities, multimodal streams usually have an inherent asynchronous challenge. Most previous methods performed manual sequence alignment before multimodal fusion, which ignored long-range dependencies among modalities and failed to learn reliable crossmodal element correlations. Inspired by the human perception paradigm, we propose a target and source Modality Co-Reinforcement (MCR) approach to achieve sufficient crossmodal interaction and fusion at different granularities. Specifically, MCR introduces two types of target modality reinforcement units to reinforce the multimodal representations jointly. These target units effectively enhance emotion-related knowledge exchange in fine-grained interactions and capture the crossmodal elements that are emotionally expressive in mixed-grained interactions. Moreover, a source modality update module is presented to provide meaningful features for the crossmodal fusion of target modalities. Eventually, the multimodal representations are progressively reinforced and improved via the above components. Comprehensive experiments are conducted on three multimodal emotion understanding benchmarks. Quantitative results show that MCR significantly outperforms the previous state-of-the-art methods in both word-aligned and unaligned settings. Additionally, qualitative analysis and visualization fully demonstrate the superiority of the proposed modules.