BackgroundStigma surrounding substance use can result in severe consequences for physical and mental health. Identifying situations in which stigma occurs and characterizing its impact could be a critical step toward improving outcomes for individuals experiencing stigma. As part of a larger research project with the goal of informing the development of interventions for substance use disorder, this study leverages natural language processing methods and a theory-informed approach to identify and characterize manifestations of substance use stigma in social media data.MethodsWe harvested social media data, creating an annotated corpus of 2,214 Reddit posts from subreddits relating to substance use. We trained a set of binary classifiers; each classifier detected one of three stigma types: Internalized Stigma, Anticipated Stigma, and Enacted Stigma, from the Stigma Framework. We evaluated hybrid models that combine contextual embeddings with features derived from extant lexicons and handcrafted lexicons based on stigma theory, and assessed the performance of these models. Then, using the trained and evaluated classifiers, we performed a mixed-methods analysis to quantify the presence and type of stigma in a corpus of 161,448 unprocessed posts derived from subreddits relating to substance use.ResultsFor all stigma types, we identified hybrid models (RoBERTa combined with handcrafted stigma features) that significantly outperformed RoBERTa-only baselines. In the model’s predictions on our unseen data, we observed that Internalized Stigma was the most prevalent stigma type for alcohol and cannabis, but in the case of opioids, Anticipated Stigma was the most frequent. Feature analysis indicated that language conveying Internalized Stigma was predominantly characterized by emotional content, with a focus on shame, self-blame, and despair. In contrast, Enacted Stigma and Anticipated involved a complex interplay of emotional, social, and behavioral features.ConclusionOur main contributions are demonstrating a theory-based approach to extracting and comparing different types of stigma in a social media dataset, and employing patterns in word usage to explore and characterize its manifestations. The insights from this study highlight the need to consider the impacts of stigma differently by mechanism (internalized, anticipated, and enacted), and enhance our current understandings of how each stigma mechanism manifests within language in particular cognitive, emotional, social, and behavioral aspects.