Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language.

Sergei Koltcov,Anton Surkov,Olessia Koltsova,Vera Ignatenko

doi:10.7717/peerj-cs.2395

Abstract

Recent advancements in large language models (LLMs) have opened new possibilities for developing conversational agents (CAs) in various subfields of mental healthcare. However, this progress is hindered by limited access to high-quality training data, often due to privacy concerns and high annotation costs for low-resource languages. A potential solution is to create human-AI annotation systems that utilize extensive public domain user-to-user and user-to-professional discussions on social media. These discussions, however, are extremely noisy, necessitating the adaptation of LLMs for fully automatic cleaning and pre-classification to reduce human annotation effort. To date, research on LLM-based annotation in the mental health domain is extremely scarce. In this article, we explore the potential of zero-shot classification using four LLMs to select and pre-classify texts into topics representing psychiatric disorders, in order to facilitate the future development of CAs for disorder-specific counseling. We use 64,404 Russian-language texts from online discussion threads labeled with seven most commonly discussed disorders: depression, neurosis, paranoia, anxiety disorder, bipolar disorder, obsessive-compulsive disorder, and borderline personality disorder. Our research shows that while preliminary data filtering using zero-shot technology slightly improves classification, LLM fine-tuning makes a far larger contribution to its quality. Both standard and natural language inference (NLI) modes of fine-tuning increase classification accuracy by more than three times compared to non-fine-tuned training with preliminarily filtered data. Although NLI fine-tuning achieves slightly higher accuracy (0.64) than the standard approach, it is six times slower, indicating a need for further experimentation with NLI hypothesis engineering. Additionally, we demonstrate that lemmatization does not affect classification quality and that multilingual models using texts in their original language perform slightly better than English-only models using automatically translated texts. Finally, we introduce our dataset and model as the first openly available Russian-language resource for developing conversational agents in the domain of mental health counseling.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language.

Abstract

Published Version

Talk to us

Similar Papers

More From: PeerJ. Computer science

Lead the way for us

Journal: PeerJ. Computer science	Publication Date: Jan 1, 2024
License type: cc-by

Similar Papers

Are atypical depression, borderline personality disorder and bipolar II disorder overlapping manifestations of a common cyclothymic diathesis?
Giulio Perugi ... Michele Fornaro
World Psychiatry | VOL. 10
Giulio Perugi, et. al.Giulio Perugi ... Michele Fornaro
01 Feb 2011
World Psychiatry | VOL. 10

Guideline Watch: Practice Guideline for the Treatment of Patients With Borderline Personality Disorder
John M Oldham
Focus | VOL. 3
John M OldhamJohn M Oldham
01 Jul 2005
Focus | VOL. 3

Relationship of Borderline Personality Disorder and Bipolar Disorder
Michael Stone
American Journal of Psychiatry | VOL. 163
Michael StoneMichael Stone
01 Jul 2006
American Journal of Psychiatry | VOL. 163

Relationship of Borderline Personality Disorder and Bipolar Disorder
Michael H Stone
American Journal of Psychiatry | VOL. 163
Michael H StoneMichael H Stone
01 Jul 2006
American Journal of Psychiatry | VOL. 163

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language.

Abstract

Published Version

Talk to us

Similar Papers

More From: PeerJ. Computer science