Short Semantic Patterns: A Linguistic Pattern Mining Approach for Content Analysis Applied to Hate Speech

Danielly Sorato,Fábio B Goularte,Renato Fileto

doi:10.1142/s0218213020400023

Abstract

Microblog posts such as tweets frequently contain users’ opinions and thoughts about events, products, people, institutions, etc. However, the usage of social media to prop-agate hate speech is not an uncommon occurrence. Analyzing hateful speech in social media is essential for understanding, fighting and discouraging such actions. We believe that by extracting fragments of text that are semantically similar it is possible to depict recurrent linguistic patterns in certain kinds of discourse. Therefore, we aim to use these patterns to encapsulate frequent statements textually expressed in microblog posts. In this paper, we propose to exploit such linguistic patterns in the context of hate speech. Through a technique that we call SSP (Short Semantic Pattern) mining, we are able to extract sequences of words that share a similar meaning in their word embedding representation. By analyzing the extracted patterns, we reveal some kinds of discourses that are replayed across a dataset, such as racist and sexist statements. Afterwards, we experiment using SSP as features to build classifiers that detect if a tweet contains hate speech (binary classification) and to distinguish between sexist, racist and clean tweets (ternary classification). The SSP instances encountered in tweets containing sexism have shown that a large number of sexist tweets began with the introduction ‘I’m not sexist but’ and ‘Call me sexist but’. Meanwhile, SSP instances found in tweets reproducing racism revealed a prominence of contents against the Islamic religion, associated entities and organizations.

Full Text