A common type system for clinical natural language processing

Stephen T Wu,Wendy W Chapman,Hongfang Liu,Guergana K Savova,Lee Becker,James J Masanz,Christopher G Chute,Pei Chen,Dmitriy Dligach,Vinod C Kaggal

doi:10.1186/2041-1480-4-1

Abstract

BackgroundOne challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text to a standard representation that is comparable and interoperable. Information may be processed and shared when a type system specifies the allowable data structures. Therefore, we aim to define a common type system for clinical NLP that enables interoperability between structured and unstructured data generated in different clinical settings.ResultsWe describe a common type system for clinical NLP that has an end target of deep semantics based on Clinical Element Models (CEMs), thus interoperating with structured data and accommodating diverse NLP approaches. The type system has been implemented in UIMA (Unstructured Information Management Architecture) and is fully functional in a popular open-source clinical NLP system, cTAKES (clinical Text Analysis and Knowledge Extraction System) versions 2.0 and later.ConclusionsWe have created a type system that targets deep semantics, thereby allowing for NLP systems to encapsulate knowledge from text and share it alongside heterogenous clinical data sources. Rather than surface semantics that are typically the end product of NLP algorithms, CEM-based semantics explicitly build in deep clinical semantics as the point of interoperability with more structured data types.

Highlights

One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous
Area 4 of the Strategic Healthcare IT Advanced Research Project (SHARP 4, or SHARPn) aims to reuse data from the Electronic medical record (EMR), analyzing records on a large scale – an effort known as high throughput phenoytyping
This type system is an extensive update of the Clinical Text Analysis and Knowledge Extraction System (cTAKES) type system, with modifications, restructuring, and additions

Summary

Introduction

One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. We aim to define a common type system for clinical NLP that enables interoperability between structured and unstructured data generated in different clinical settings. Electronic medical records (EMRs) hold immense promise for improving both practice and research. Area 4 of the Strategic Healthcare IT Advanced Research Project (SHARP 4, or SHARPn) aims to reuse data from the EMR, analyzing records on a large scale – an effort known as high throughput phenoytyping. Element Models (CEMs) as the standardized format for information aggregation and comparison. This representation is both concrete and specific, yet allows for some of the ambiguity that is inherent in clinicians’ explanation of a clinical situation

Objectives

Methods

Results

Conclusion