Abstract

Visually guided spatial sound generation (VGSSG) is a well-suited multimodal learning method for dealing with recorded videos. However, existing methods are difficult to be directly applied to spatial sound generation for movie clips. This is mainly due to (1) the existence of Cinematic Audiovisual Language (CAL) in movies, which makes it difficult to construct spatial sound mapping models directly through data-driven based methods. (2) The problem of the inadequate model performance, which is caused by the excessive heterogeneous gap between audiovisual modal information. To solve the aforementioned problems, we propose a VGSSG method based on CAL decision-making and hierarchical feature coding and decoding, which effectively accomplishes spatial sound generation based on the CAL of movies. Specifically, to solve the problem of CAL modeling, a multimodal information-guided movie audio rendering decision maker is established, which can decide the rendering strategy based on the CAL of the current clip. To narrow the heterogeneous gap that hinders the fusion between audiovisual modal data, we propose a codec structure based on hierarchical fusion of audiovisual features and full-scale skip-connections, which improves the efficiency of the comprehensive utilization of audiovisual modal data, and demonstrates the effectiveness of adopting shallow features in VGSSG task. We integrate both 2-channel and 6-channel spatial audio generation into a unified framework. In addition, we establish a movie audiovisual bimodal dataset with hand-crafted CAL annotations. Experimentally, we demonstrate that compared with the existing methods, our method has higher performance in terms of reducing generation distortion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call