The quality of videos is the primary concern of video service providers. Built upon deep neural networks, video quality assessment (VQA) has rapidly progressed. Although existing works have introduced the knowledge of the human visual system (HVS) into VQA, there are still some limitations that hinder the full exploitation of HVS, including incomplete modeling with few HVS characteristics and insufficient connection among these characteristics. In this article, we present a novel spatial-temporal VQA method termed HVS-5M, wherein we design five modules to simulate five characteristics of HVS and create a bioinspired connection among these modules in a cooperative manner. Specifically, on the side of the spatial domain, the visual saliency module first extracts a saliency map. Then, the content-dependency and the edge masking modules extract the content and edge features, respectively, which are both weighted by the saliency map to highlight those regions that human beings may be interested in. On the other side of the temporal domain, the motion perception module extracts the dynamic temporal features. Besides, the temporal hysteresis module simulates the memory mechanism of human beings and comprehensively evaluates the video quality according to the fusion features from the spatial and temporal domains. Extensive experiments show that our HVS-5M outperforms the state-of-the-art VQA methods. Ablation studies are further conducted to verify the effectiveness of each module toward the proposed method. The source code is available at https://github.com/GZHU-DVL/HVS-5M.