HealthE: Recognizing Health Advice &amp; Entities in Online Health Communities

Joseph Gatto,Madhusudan Basak,Parker Seegmiller,Sarah Masud Preum,Garrett M Johnston

doi:10.1609/icwsm.v17i1.22210

Abstract

The task of extracting and classifying entities is at the core of important Health-NLP systems such as misinformation detection, medical dialogue modeling, and patient-centric information tools. Granular knowledge of textual entities allows these systems to utilize knowledge bases, retrieve relevant information, and build graphical representations of texts. Unfortunately, most existing works on health entity recognition are trained on clinical notes, which are both lexically and semantically different from public health information found in online health resources or social media. In other words, existing health entity recognizers vastly under-represent the entities relevant to public health data, such as those provided by sites like WebMD. It is crucial that future Health-NLP systems be able to model such information, as people rely on online health advice for personal health management and clinically relevant decision making. In this work, we release a new annotated dataset, HealthE, which facilitates the large-scale analysis of online textual health advice. HealthE consists of 3,400 health advice statements with token-level entity annotations. Additionally, we release 2,256 health statements which are not health advice to facilitate health advice mining. HealthE is the first dataset with an entity-recognition label space designed for the modeling of online health advice. We motivate the need for HealthE by demonstrating the limitations of five widely-used health entity recognizers on HealthE, such as those offered by Google and Amazon. We additionally benchmark three pre-trained language models on our dataset as reference for future research. All data is made publicly available.

Full Text