Characteristics of Multi-Class Suicide Risks Tweets Through Feature Extraction and Machine Learning Techniques

Yan Qian Lim,Yim Ling Loo

doi:10.30630/joiv.7.4.2284

Abstract

This paper presents a detailed analysis of the linguistic characteristics connected to specific levels of suicide risks, providing insight into the impact of the feature extraction techniques on the effectiveness of the predictive models of suicide ideation. Prevalent initiatives of research works had been observed in the detection of suicide ideation from social media posts through feature extraction and machine learning techniques but scarcely on the multiclass classification of suicide risks and analysis of linguistic characteristics' impact on predictability. To address this issue, this paper proposes the implementation of a machine learning framework that is capable of analyzing multiclass classification of suicide risks from social media posts with extended analysis of linguistic characteristics that contribute to suicide risk detection. A total of 552 samples of a supervised dataset of Twitter posts were manually annotated for suicide risk modeling. Feature extraction was done through a combination of feature extraction techniques of term frequency-inverse document frequency (TF-IDF), Part-of-Speech (PoS) tagging, and valence-aware dictionary for sentiment reasoning (VADER). Data training and modeling were conducted through the Random Forest technique. Testing of 138 samples with scenarios of detections in real-time data for the performance evaluation yielded 86.23% accuracy, 86.71% precision, and 86.23% recall, an improved result with a combination of feature extraction techniques rather than data modeling techniques. An extended analysis of linguistic characteristics showed that a sentence's context is the main contributor to suicide risk classification accuracy, while grammatical tags and strong conclusive terms were not.

Full Text