With the rapid development of the Internet and multimedia technologies, multimedia applications integrating audio and video are becoming increasingly prevalent in both everyday life and professional environments. A critical challenge is to significantly enhance compression efficiency and bandwidth utilization while maintaining high-quality user experiences. To address this challenge, the Just Noticeable Distortion (JND) estimation model, which leverages the perceptual characteristics of the Human Visual System (HVS), is widely used in image and video coding for improved data compression. However, human visual perception is an integrative process that involves both visual and auditory stimuli. Therefore, this paper investigates the influence of audio signals on visual perception and presents a collaborative audio–video JND estimation model tailored for multimedia applications. Specifically, we characterize audio loudness, duration, and energy as temporal perceptual features, while assigning the audio saliency superimposed on the image plane as the spatial perceptual feature. An audio JND adjustment factor is then designed using a segmentation function. Finally, the proposed model combines the video-based JND model with the audio JND adjustment factor to form the audio–video collaborative JND estimation model. Compared with existing JND models, the model presented in this paper achieves the best subjective quality, with an average PSNR value of 26.97 dB. The experimental results confirm that audio significantly impacts human visual perception. The proposed audio–video collaborative JND model effectively enhances the accuracy of JND estimation for multimedia data, thereby improving compression efficiency and maintaining high-quality user experiences.