Deep learning methods have demonstrated remarkable success in processing static datasets for various video tasks. However, when confronted with continuous data streams, these approaches often encounter the challenge of catastrophic forgetting. This phenomenon leads to a significant decline in overall performance when learning new classes incrementally. Moreover, existing methods tend to overlook the correlation between audio and visual modalities in video incremental learning, despite their joint significance in scene comprehension. How to continuously learn from new classes while maintaining the knowledge of old videos with limited storage and computing resources is becoming imperative in the field of multimodal learning. In this paper, we introduce CavRL, a pioneering benchmark for audio–visual representation learning under class incremental scenarios. To mitigate catastrophic forgetting, we propose a rehearsal-based training approach that leverages a small exemplar set from previous classes. Our approach constrains the memory buffer within strict storage limits, optimizing exemplar selection by learning correlative audio–visual representations. Additionally, we employ a distillation method to mitigate forgetting in a self-supervised manner. Evaluations on two prevalent multimodal tasks: audio–visual event classification and audio–visual speaker recognition, which demonstrate that CavRL outperforms existing state-of-the-art incremental learning methods across various settings. We anticipate that CavRL will significantly advance research in continual multimodal learning.