In this paper, we present a novel model that enhances performance by extending the dual-modality TEVAD model—originally leveraging visual and textual information—into a multi-modal framework that integrates visual, audio, and textual data. Additionally, we refine the multi-scale temporal network (MTN) to improve feature extraction across multiple temporal scales between video snippets. Using the XD-Violence dataset, which includes audio data for violence detection, we conduct experiments to evaluate various feature fusion methods. The proposed model achieves an average precision (AP) of 83.9%, surpassing the performance of single-modality approaches (visual: 73.9%, audio: 67.1%, textual: 29.9%) and dual-modality approaches (visual + audio: 78.8%, visual + textual: 78.5%). These findings demonstrate that the proposed model outperforms models based on the original MTN and reaffirm the efficacy of multi-modal approaches in enhancing violence detection compared to single- or dual-modality methods.
Read full abstract