Aggression detection in speech using sensor and semantic information
conference paper
By analyzing a multimodal (audio-visual) database with aggressive incidents in trains, we have observed that there are no trivial fusion algorithms to successfully predict multimodal aggression based on unimodal sensor inputs. We proposed a fusion framework that contains a set of intermediate level variables (meta-features) between the low level sensor features and the multimodal aggression detection [1]. In this paper we predict the multimodal level of aggression and two of the meta-features: Context and Semantics. We do this based on the audio stream, from which we extract both acoustic (nonverbal) and linguistic (verbal) information. Given the spontaneous nature of speech in the database, we rely on a keyword spotting approach in the case of verbal information. We have found the existence of 6 semantic groups of keywords that have a positive influence on the prediction of aggression and of the two meta-features. © 2012 Springer-Verlag.
TNO Identifier
463900
Publisher
Springer
Source title
15th International Conference on Text, Speech and Dialogue, TSD 2012, 3-7 September 2012, Brno, Czech Republic
Place of publication
Berlin : [etc]
Pages
665-672