Abstract

Multimodal fusion is a complex topic. For surveillance applications audio–visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio–visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.