The challenge of video quality assessment (VQA) modeling for user-generated content (UGC) (i.e., UGC-VQA) is how to accurately extract discriminative features and elaborately quantify interfeature interactions by following the behavior patterns of human eye-brain vision perception. To address this issue, we propose the Deeper Spatial–Temporal Scoring Network (DSTS-Net) to give a precise VQA. Concretely, we first deploy the multiscale feature extraction module to characterize content-aware features accounting for nonlinear reverse hierarchy theory in video perception process, which is not fully considered in the reported UGC-VQA models. Hierarchical handcraft and semantic features are simultaneously considered using content adaptive weighting. Second, we develop a feature integration structure, i.e., deeper gated recurrent unit (DGRU), to fully imitate the interfeature interactions in visionary perception, including feedforward and feedback processes. Third, the dual DGRU structure is employed to further account for interframe interactions of hierarchical features, imitating the nonlinearity of perception as much as possible. Finally, improved pooling is achieved in the local adaptive smoothing module accounting for the temporal hysteresis. Holistic validation of the proposed method on four public challenging UGC-VQA datasets presents a comparable performance over the state-of-the-art no-reference VQA methods, especially our method can give an accurate prediction of the low quality videos with weak temporal correlation. To promote reproducible research and public evaluation, an implementation of our method has been made available online: https://github.com/liu0527aa/DSTS-Net.