Multilevel fusion of multimodal deep features for porn streamer recognition in live video

Liyuan Wang,Jimiao Tian,Meng Wang,Li Zhuo,Jing Zhang

doi:10.1016/j.patrec.2020.09.027

Abstract

Abstract Live video hosted by streamers is being sought after by an increasing number of Internet users. Some streamers mix pornographic content with live video for profit and popularity, but this greatly harms the network environment. To effectively identify porn streamers, a multilevel fusion method of multimodal deep features for porn streamer recognition in live video is proposed in this paper. (1) Visual and audio features including spatial, audio, motion, and temporal context in live video are extracted by a multimodal deep network. (2) Audio-visual attention features are obtained by fusing visual and audio features at the feature level based on a multimodal attention mechanism. (3) Text features are extracted by using the bullet screen text network based on the BERT (bidirectional encoder representations from transformers) model after collecting text information from the viewers’ bullet screen comments. (4) The prediction results of the audio-visual deep network and the bullet screen text network are fused at the decision level to improve the porn streamer recognition accuracy. We build a real-world dataset of porn streamers and conduct experiments and demonstrate that our method can improve the porn streamer recognition accuracy.

Full Text