Abstract

[We report of the procedures of developing a large representative corpus of 50,000 sentences taken from clinical notes. Previous reports of annotated corpus of clinical notes have been small and they do not represent the whole domain of clinical notes. The sentences included in this corpus have been selected from a very large raw corpus of ten thousand documents. These ten thousand documents are sampled from an internal repository of more than 700,000 documents taken from multiple health care providers. Each of the documents is de-identified to remove any PHI data. Using the Penn Treebank tagging guidelines with a bit of modifications, we annotate this corpus manually with an average inter-annotator agreement of more than 98%. The goal is to create a parts of speech annotated corpus in the clinical domain that is comparable to the Penn Treebank and also represents the totality of the contemporary text as used in the clinical domain. We also report the output of the TnT tagger trained on the initial 21,000 annotated sentences reaching a preliminary accuracy of above 96%.]

Highlights

  • Automated parts of speech (PoS) annotation have been an active field of research for more than 40 years

  • There are quite a few of tools already available with an impressive accuracy returns (Toutanova et al, 2003; Shen et al, 2007; Spoustova et al, 2009; Søgaard, 2010). This is true in the general domain text such as news reports or general domain articles. When it comes to a niche area like clinical domain, no automated parts of speech taggers are readily available nor has there been any report of any such large corpus developed that meet the standards as set out in the general domain

  • The existing corpora are small extensive analysis has been done on them

Read more

Summary

Introduction

Automated parts of speech (PoS) annotation have been an active field of research for more than 40 years now. There are quite a few of tools already available with an impressive accuracy returns (Toutanova et al, 2003; Shen et al, 2007; Spoustova et al, 2009; Søgaard, 2010) This is true in the general domain text such as news reports or general domain articles. There have been previous attempts for creating PoS annotated corpus in the clinical domain (Tateisi et al, 2004; Pakhomov et al, 2006; Albright et al, 2013). All of these corpora are relatively small and the PoS taggers trained on them have not been shown to reach above 96% in the clinical domain. None of these two adaptation methods enhance the accuracy levels to more than 95%

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.