Experiences implementing scalable, containerized, cloud-based NLP for extracting biobank participant phenotypes at scale.

Paul Avillach,Kenneth D Mandl,Timothy A Miller

doi:10.1093/jamiaopen/ooaa016

Paul Avillach, Kenneth D Mandl + Show 1 more

Open Access

https://doi.org/10.1093/jamiaopen/ooaa016

Copy DOI

Journal: JAMIA open	Publication Date: May 22, 2020
Citations: 5	License type: CC BY 4.0

Affiliation: Boston Children's Hospital, Harvard University

Abstract

ObjectiveTo develop scalable natural language processing (NLP) infrastructure for processing the free text in electronic health records (EHRs).Materials and MethodsWe extend the open-source Apache cTAKES NLP software with several standard technologies for scalability. We remove processing bottlenecks by monitoring component queue size. We process EHR free text for patients in the PrecisionLink Biobank at Boston Children’s Hospital. The extracted concepts are made searchable via a web-based portal.ResultsWe processed over 1.2 million notes for over 8000 patients, extracting 154 million concepts. Our largest tested configuration processes over 1 million notes per day.DiscussionThe unique information represented by extracted NLP concepts has great potential to provide a more complete picture of patient status.ConclusionNLP large EHR document collections can be done efficiently, in service of high throughput phenotyping.

Full Text