Abstract

BackgroundIdentifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance.ResultsHidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy.ConclusionsThe current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.

Highlights

  • Identifying phrases that refer to particular concept types is a critical step in extracting information from documents

  • Concept mention detection is the task of identifying phrases in documents that refer to particular concept types

  • In this study, we compared all-types-at-once and onetype-at-a-time strategies in applying Hidden Markov Model (HMM) taggers on a clinical corpus released in the 2010 i2b2/VA natural language processing (NLP) challenge workshop and a biological literature corpus released in the JNLPBA workshop

Read more

Summary

Introduction

Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with documents annotated with concept phrases as training data, supervised machine learning can be used to automate concept mention detection. In the clinical domain, annotated clinical notes have been recently released to the research community through pioneering efforts [3,4] These annotated data sets have promoted application of machine. When the detection task involves two or more target concept types, there is an option to build one machine learning model for all types (all-types-at-once strategy) or to build multiple models each tackling one type (onetype-at-a-time strategy). The former strategy may have an advantage in exploiting dependency among concept types. With current ongoing efforts on corpus development in the clinical domain, we believe this would be a timely question to pose

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.