Abstract
In this paper we explore automatic genre labelling of motion picture previews using audio-visual features present in movie trailers and the focus is on fusion techniques (early fusion and late fusion) and the resultant improvement on classification accuracy. This paper proposes a novel combination of deep learned features (from a pretrained VGG-16 model) obtained using a state-of-the-art shot detector and hand-crafted audio features. This combination of features and an associated comparison of early and late fusion with these features has not been attempted in the literature before. Furthermore, two popular fusion techniques and three distinct classification algorithms are investigated to determine the optimal fusion technique and classifier combination. The study uses a subset of the LMTD-9 movie trailer dataset with selected genres (action, comedy, drama and horror). The best performing low-level audio features are comprised of timbre features extracted using the MIRtoolbox followed by standalone mel-frequency cepstral coefficients. The best performing high-level audio feature is tonality. Audio features are augmented by visual features extracted using a pre-trained convolutional neural network (VGG-16). Feature fusion (early and late fusion) methods are investigated together with classification methods such as extreme gradient boosting, support vector machine and a neural network. Evaluation metrics such as precision, recall, confusion matrices and F1 score are used to measure classification accuracy. Early fusion methods outperform late fusion methods with a classification performance gain of approximately 10% for a four class classification problem. The best classification performance for early fusion obtained with a support vector machine is (73.12% accuracy), followed by the extreme gradient boosting classifier (69.37% accuracy) and neural network classifier (67.50% accuracy), whereas chance is 25%. It is shown that superior classification performance can be achieved by employing early feature fusion of low-level audio descriptors, high-level audio descriptors and high-level visual feature descriptors together with suitable classifiers.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have