Learning the Fusion of Audio and Video Aggression Assessment by Meta-Information from Human Annotations