Abstract

Short video is one of the most popular forms of user generated contents and it is also a carrier of people’s emotion. However, researches on the emotional consistency between audio and video are limited, and there is also a lack of relevant datasets. In this paper, we propose a multi-model fusion system for assessing emotional consistency between different types of action videos and audios with different emotions. We also build a new dataset and compare the early fusion and late fusion methods on this dataset. We use video features extracted by a pre-trained C3D network and audio features extracted by Librosa, a tool for audio analysis. In early fusion method, we concatenate video features and audio features and train a SVM with a linear kernel using the fused features. In late fusion method, video features and audio features are used for training separately to get their own decisions. Then we fuse these two kinds of decisions to get the classification result. Our best classifier attained 85.56% accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call