Multimodal prediction of profanity based on speech analysis

Ivan Smirnov,Anastasia Laushkina

doi:10.1016/j.procs.2023.12.008

Abstract

With increasing multimedia content and social activities, moderation problems increase. There are different approaches to moderation and automation. However, they have limitations in terms of usage in real-time. The analysis of scientific papers revealed that most of the more common approaches solve the task of detection instead of prediction by considering the final utterance. For this reason, calls are unprotected in toxic languages, and online broadcasts can be unpredictable.In this work, a new way for automatic speech moderation in terms of dynamic word prediction was suggested. The considered task involves the analysis of the auditory and textual channels of speech. Words can have different meanings depending on the context, so in solving the problem it is planned to consider profanity, which is socially unacceptable regardless of the context.In this paper approaches for working with speech stream in the task of profanity prediction were proposed. It can be possible to have smaller latency with usage of audio features. We also suggest the pipeline for real-time (with the ability to predict the sequence with a higher duration than the latency of the processing) prediction for multimodal prediction, which compensates the latency of ASR systems. As a result, in this paper, we compared different solutions for the next color prediction task for English speech and reached the F1 score of 86.6 for 3 class prediction.

Full Text