A system for de-identifying medical message board text

Adrian Benton,Shawndra Hill,John H Holmes,Cristin Freeman,Charles Leonard,Annie Chung,Lyle Ungar

doi:10.1186/1471-2105-12-s3-s2

Abstract

There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.

Highlights

Medical message boards (MMBs) serve as forums for emotional support and information exchange, usually for patients with similar conditions
Users of MMBs communicate by asynchronously posting messages to the board in threads, groups of messages that are typically centered on a single topic
In order to improve and validate our system, we created a development set with 500 messages sampled from the breast cancer (BC) corpus (31,232 non-punctuation tokens, 483 names total) distinct from the set on which our classifier was trained, and a test set with 500 messages sampled from the arthritis corpus (28,146 non-punctuation tokens, 432 names total)

Summary

Introduction

Medical message boards (MMBs) serve as forums for emotional support and information exchange, usually for patients with similar conditions. Because of the sheer number, inexpensiveness, and candid nature of messages posted on these boards, many researchers have begun to treat MMB threads as “virtual focus groups” to gain more knowledge about patient experiences [1,2,3]. As more patients gain access to the Internet and join these communities, more MMB text on patient experiences will become available, providing researchers with sample of 500 posts resulted in correctly identifying 81.2% of proper names with a precision of 61.7%. This does not take into account any usernames that were present in these documents. We frame the task of de-identifying MMB text as a specialized form of NER

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jun 9, 2011
Citations: 39	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A system for de-identifying medical message board text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A System for De-identifying Medical Message Board Text
Adrian Benton ... Shawndra Hill
-
Adrian Benton, et. al.Adrian Benton ... Shawndra Hill
01 Dec 2010
01 Dec 2010

Topics in machine learning for biomedical literature analysis and text retrieval
Rezarta Islamaj Doğan ... Lana Yeganova
BMC Bioinformatics | VOL. 12
Rezarta Islamaj Doğan, et. al.Rezarta Islamaj Doğan ... Lana Yeganova
09 Jun 2011
BMC Bioinformatics | VOL. 12

A Method to Detect Errors in Electronic Discharge Summaries Based on Named Entity Recognition
D.S Yuan ... Y Tian
-
D.S Yuan, et. al.D.S Yuan ... Y Tian
01 Jan 2015
01 Jan 2015

Abstract PR-12: Towards verifying results from biomedical deep learning models using the UMLS: Cases of primary tumor site classification and cancer Named Entity Recognition
Joan Byamugisha ... Asad Jeewa
Clinical Cancer Research | VOL. 27
Joan Byamugisha, et. al.Joan Byamugisha ... Asad Jeewa
01 Mar 2021
Clinical Cancer Research | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A system for de-identifying medical message board text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics