Abstract

In heterogeneous networks, different modalities are coexisting. For example, video sources with certain lengths usually have abundant time-varying audiovisual data. From the users’ perspective, different video segments will trigger different kinds of emotions. In order to better interact with users in heterogeneous networks and improve their user experiences, affective video content analysis to predict users’ emotions is essential. Academically, users’ emotions can be evaluated by arousal and valence values, and fear degree, which provides an approach to quantize the prediction accuracy of the reaction of the audience and users towards videos. In this paper, we propose the multimodal data fusion method for integrating the visual and audio data in order to perform the affective video content analysis. Specifically, to align the visual and audio data, the temporal attention filters are proposed to obtain the time-span features of the entire video segments. Then, by using the two-branch network structure, matched visual and audio features are integrated in the common space. At last, the fused audiovisual feature is employed for the regression and classification subtasks in order to measure the emotional responses of users. Simulation results show that the proposed method can accurately predict the subjective feelings of users towards the video contents, which provides a way to predict users’ preferences and recommend videos according to their own demand.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call