No-reference (NR) video quality assessment (VQA) is a challenging problem due to the difficulty in model training caused by insufficient annotation samples. Previous work commonly utilizes transfer learning to directly migrate pre-trained models on the image database, which suffers from domain inadaptation. Recently, self-supervised representation learning has become a hot spot for the independence of large-scale labeled data. However, existing self-supervised representation learning method only considers the distortion types and contents of the video, there needs to investigate the intrinsic properties of videos for the VQA task. To amend this, here we propose a novel multi-task self-supervised representation learning framework to pre-train a video quality assessment model. Specifically, we consider the effects of distortion degrees, distortion types, and frame rates on the perceived quality of videos, and utilize them as guidance to generate self-supervised samples and labels. Then, we optimize the ability of the VQA model in capturing spatio-temporal differences between the original video and the distorted version using three pretext tasks. The resulting framework not only eases the requirements for the quality of the original video but also benefits from the self-supervised labels as well as the Siamese network. In addition, we propose a Transformer-based VQA model, where short-term spatio-temporal dependencies of videos are modeled by 3D-CNN and 2D-CNN, and then the long-term spatio-temporal dependencies are modeled by Transformer because of its excellent long-term modeling capability. We evaluated the proposed method on four public video quality assessment databases and found that it is competitive with all compared VQA algorithms.