Abstract

The recent development of deep learning techniques, like convolutional networks, shed a new light over the video (story) scene segmentation problem, bringing the potential to outperform state-of-the-art non-deep learning multimodal approaches. However, one important aspect of the multimodality still needs investigation in the context of deep learning: the multimodal fusion. Often, features are directly fed to a network, which may be an inadequate approach to perform the underlying multimodal fusion. This paper presents an evaluation of early and late approaches to deep learning multimodal fusion. In addition, it proposes a new deep learning model to perform video scene segmentation, based on convolutional network feature extraction capabilities and a recurrent neural network architecture. The results show the early versus late fusion discussion is reopened regarding deep learning. Moreover, the results prove the proposed model is competitive against state-of-the-art techniques when evaluated over a public documentary video dataset obtaining up to 64 of average FCO, while also maintaining a lower computational cost when compared with a related convolutional approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call