Enhanced Word Embedding Variations for the Detection of Substance Abuse and Mental Health Issues on Social Media Writings

Diana Ramirez-Cifuentes,Ana Freire,Ricardo Baeza-Yates,Christine Largeron,Julien Tissier

doi:10.1109/access.2021.3112102

Abstract

Substance abuse and mental health issues are severe conditions that affect millions. Signs of certain conditions have been traced on social media through the analysis of posts. In this paper we analyze textual cues that characterize and differentiate Reddit posts related to depression, eating disorders, suicidal ideation, and alcoholism, along with control posts. We also generate enhanced word embeddings for binary and multi-class classification tasks dedicated to the detection of these types of posts. Our enhancement method to generate word embeddings focuses on identifying terms that are predictive for a class and aims to move their vector representations close to each other while moving them away from the vectors of terms that are predictive for other classes. Variations of the embeddings are defined and evaluated through predictive tasks, a cosine similarity-based method, and a visual approach. We generate predictive models using variations of our enhanced representations with statistical and deep learning approaches. We also propose a method that leverages the properties of the enhanced embeddings in order to build features for predictive models. Results show that variations of our enhanced representations outperform in Recall, Accuracy, and F1-Score the embeddings learned with Word2vec , DistilBERT , GloVe ’s fine-tuned pre-learned embeddings and other methods based on domain adapted embeddings. The approach presented has the potential to be used on similar binary or multi-class classification tasks that deal with small domain-specific textual corpora.

Highlights

Substance abuse and mental disorders are serious conditions that impact people’s thinking, mood, feelings, and behavior
Our main goals are twofold: first, to identify textual elements that characterize each of the conditions analyzed, and that distinguish these conditions from each other; including elements that differentiate mental conditions in general (MEN) from control cases (CON); second, to define automated methods capable to detect posts related to each of the conditions addressed through the introduction of a word embedding generation model that identifies and takes advantage of the terms that are mostly used on the posts of users presenting a given condition
For each plot we report PCA’s Total Explained Variance Percentage (TEVP), which is an indicator of the percentage of information retained by the two resulting components, and that is given by the aggregation of the Explained Variance Ratio of each component

Summary

Introduction

Substance abuse and mental disorders are serious conditions that impact people’s thinking, mood, feelings, and behavior. These conditions can affect the dairy activities of a person and the way they relate to others. Our main goals are twofold: first, to identify textual elements that characterize each of the conditions analyzed, and that distinguish these conditions from each other; including elements that differentiate mental conditions in general (MEN) from control cases (CON); second, to define automated methods capable to detect posts related to each of the conditions addressed through the introduction of a word embedding generation model that identifies and takes advantage of the terms that are mostly used on the posts of users presenting a given condition. We built and evaluated different predictive models allowing to compare our enhanced embeddings with embeddings generated by other methods including domain adaptation approaches

Objectives

Results

Conclusion