Video-based action recognition is an important task in the computer vision community, aiming to extract rich spatial–temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale unlabeled datasets and exploit transfer learning in the downstream action recognition task. Though much progress has been made for action recognition with video representation learning, two main issues remain for most existing methods. Firstly, the pre-training with self-supervised pretext tasks usually learns neutral and not much informative representations for the downstream action recognition task. Secondly, the valuable learned knowledge from large-scaled pre-training datasets will be gradually forgotten in the fine-tuning stage. To address such issues, in this paper, we propose a novel video representation learning method with knowledge-guided pre-training and fine-tuning for action recognition, which incorporates external human parsing knowledge for generating informative representation in the pre-training, and preserves the pre-trained knowledge in the fine-tuning stage to avoid catastrophic forgetting via self-distillation. Our model, with contributions from the external human parsing knowledge, video-level contrastive learning, and knowledge preserving self-distillation, achieves state-of-the-art performance on two popular benchmarks, i.e., UCF101 and HMDB51, verifying the effectiveness of the proposed method.