Video-based facial expression recognition (FER) has received increased attention as a result of its widespread applications. However, a video often contains many redundant and irrelevant frames. How to reduce redundancy and complexity of the available information and extract the most relevant information to facial expression in video sequences is a challenging task. In this paper, we divide a video into several short clips for processing and propose a clip-aware emotion-rich feature learning network (CEFLNet) for robust video-based FER. Our proposed CEFLNet identifies the emotional intensity expressed in each short clip in a video and obtains clip-aware emotion-rich representations. Specifically, CEFLNet constructs a clip-based feature encoder (CFE) with two-cascaded self-attention and local–global relation learning, aiming to encode clip-based spatio-temporal features from the clips of a video. An emotional intensity activation network (EIAN) is devised to generate emotional activation maps for locating the salient emotion clips and obtaining clip-aware emotion-rich representations, which are used for expression classification. The effectiveness and robustness of the proposed CEFLNet are evaluated using four public facial expression video datasets, including BU-3DFE, MMI, AFEW, and DFEW. Extensive experiments demonstrate the improved performance of our proposed CEFLNet in comparison with the state-of-the-art methods.