Automated textual descriptions for a wide range of video events with 48 human actions

bookPart
Presented is a hybrid method to generate textual descriptions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the video, such that they can be used as verbs for the description generator. The aim of the description generator is (1) to find the actors (objects or persons) in the video and connect these correctly to the verbs, such that these represent the subject, and direct and indirect objects, and (2) to generate a sentence based on the verb, subject, and direct and indirect objects. The novelty of our method is that we exploit the discriminative power of a bag-of-features action detector with the generative power of a rule-based action descriptor. Shown is that this approach outperforms a homogeneous setup with the rule-based action detector and action descriptor.
TNO Identifier
463593
Publisher
Springer Verlag
Source title
Computer Vision – ECCV 2012. Workshops and Demonstrations - European Conference on Computer Vision, 7-13 October 2012, Firenze, Italie, Part I
Editor(s)
Fusiello, A.
Murino, V.
Cucchiara, R.
Place of publication
Berlin : [etc]
Pages
372-380