Abstract

Fake news, hate speech, crude language, ethnic and racial slurs and more have been spreading widely every day, yet in Sri Lanka, there is no definite solution to save our society from such profanities. The method we propose detects racist, sexist and cursing objectionable content of Sinhala, Tamil and English languages. To selectively filter out the potentially objectionable audio content, the input audio is first preprocessed, converted into text format, and then such objectionable content is detected with a machine learning filtering mechanism. In order to validate its offensive nature, a preliminary filtering model was created which takes the converted sentences as input and classifies them through a binary classification. When the text is classified as offensive, then secondary filtering is carried out with a separate multi-class text classification model which classifies each word in the sentence into sexist, racist, cursing, and non-offensive categories. The models in preliminary filtering involve the Term Frequency–Inverse Document Frequency (TF-IDF) vectorizer and Support Vector Machine algorithm with varying hyperparameters. As for the multi-class classification model for Sinhala language, the combination of Logistic Regression (LR) and Countvectorizer was used while the Multinomial Naive Bayes and TF-IDF vectorizer model was found suitable for Tamil. For English, LR with Countvectorizer model was chosen to proceed. The system has an 89% and 77% accuracy of detection for Sinhala and Tamil respectively. Finally, the detected objectionable content is replaced in the audio with a predetermined audio input.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.