Abstract

Cyberbullying is an act of bullying with the use of technology as a medium. The exponential growth of social media has exposed many people especially young people to the risk of being target of cyberbullying. With the help of machine learning, we can detect language patterns from the cyberbullying posts and develop a model to detect cyberbullying content automatically. This project uses data from Formspring.me, a question-and-answer formatted website. The data is labeled by Amazon’s Mechanical Turk, a web service. After data preparation and pre-processing, Scikit-learn, a python library is used to train a model to classify if cyberbullying is present in the post by recognizing bullying content based on its insult words as features. Four machine learning techniques, namely, logistic regression, decision tree, random forest and support vector machine are used to train the model. Consequently, all four algorithms achieve more than 78% of accuracy in identifying the true positive where support vector machine achieves the best performance with a score of 87% of accuracy in identifying the true positive. Furthermore, the context of cyberbullying post is investigated and categorised into five categories, namely, swearing, abusive, sexism, vulgar, and racism. Lastly, a study on the severity level and the context of cyberbullying content analyzed swearing as the most frequent category of cyberbullying in this dataset. Keywords: cy berbully; machine learning; social media; natural language processing

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call