Print Email Facebook Twitter A comparative study on automatic audio-visual fusion for aggression detection using meta-information Title A comparative study on automatic audio-visual fusion for aggression detection using meta-information Author Lefter, I. Rothkrantz, L.J.M. Burghouts, G.J. Publication year 2013 Abstract Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio- visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases. © 2013 Elsevier B.V. All rights reserved. Subject Physics & ElectronicsII - Intelligent ImagingTS - Technical SciencesSafety and SecurityDefenceDefence, Safety and SecurityAudio-visual fusionAutomatic surveillanceContext-based fusionMeta-information To reference this document use: http://resolver.tudelft.nl/uuid:c9908de6-f37d-402f-a08a-195928603cf6 DOI https://doi.org/10.1016/j.patrec.2013.01.002 TNO identifier 481455 Source Pattern Recognition Letters, 34 (15), 1953-1963 Document type article Files To receive the publication files, please send an e-mail request to TNO Library.