Video super-resolution (VSR) aims to use multiple consecutive low-resolution frames to recover the corresponding high-resolution frames. However, existing VSR methods only consider videos as image sequences, ignoring another essential timing informationaudio, while in fact, there is a semantic link between audio and vision, and extensive studies have shown that audio can provide supervisory information in visual networks. Meanwhile, the addition of semantic priors has been proven to be effective in super-resolution (SR) tasks, but a pretrained segmentation network is required to obtain semantic segmentation maps. By contrast, audio as the information contained in the video itself can be directly used. Therefore, in this study, we propose a novel and pluggable multiscale audiovisual fusion (MS-AVF) module to enhance VSR performance by exploiting the relevant audio information, which can be regarded as implicit semantic guidance compared with the kind of explicit segmentation priors. Specifically, we first fuse audiovisual features on the semantic feature maps of different granularities of the target frames, and then through a top-down multiscale fusion approach, feedback high-level semantics to the underlying global visual features layer by layer, thereby providing effective audio implicit semantic guidance for VSR. Experimental results show that audio can further improve the VSR effect. Moreover, by visualizing the learned attention mask, the proposed end-to-end model can automatically learn potential audiovisual semantic links, especially improving the accuracy and effectiveness of the SR of sound sources and their surrounding regions.