Abstract

There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.

Highlights

  • Medical message boards (MMBs) serve as forums for emotional support and information exchange, usually for patients with similar conditions

  • Users of MMBs communicate by asynchronously posting messages to the board in threads, groups of messages that are typically centered on a single topic

  • In order to improve and validate our system, we created a development set with 500 messages sampled from the breast cancer (BC) corpus (31,232 non-punctuation tokens, 483 names total) distinct from the set on which our classifier was trained, and a test set with 500 messages sampled from the arthritis corpus (28,146 non-punctuation tokens, 432 names total)

Read more

Summary

Introduction

Medical message boards (MMBs) serve as forums for emotional support and information exchange, usually for patients with similar conditions. Because of the sheer number, inexpensiveness, and candid nature of messages posted on these boards, many researchers have begun to treat MMB threads as “virtual focus groups” to gain more knowledge about patient experiences [1,2,3]. As more patients gain access to the Internet and join these communities, more MMB text on patient experiences will become available, providing researchers with sample of 500 posts resulted in correctly identifying 81.2% of proper names with a precision of 61.7%. This does not take into account any usernames that were present in these documents. We frame the task of de-identifying MMB text as a specialized form of NER

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.