Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning

Zhuoran Wang,Anoop D Shah,John Shawe-Taylor,A Rosemary Tate,Spiros Denaxas,Harry Hemingway,Vladimir Brusic

doi:10.1371/journal.pone.0030412

Zhuoran Wang, Anoop D Shah + Show 5 more

Open Access

https://doi.org/10.1371/journal.pone.0030412

Copy DOI

Abstract

BackgroundElectronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually.AimTo develop an algorithm to identify relevant free texts automatically based on labelled examples.MethodsWe developed a novel machine learning algorithm, the ‘Semi-supervised Set Covering Machine’ (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our ‘Freetext Matching Algorithm’ natural language processor.ResultsOnly 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%).ConclusionsOur novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.

Highlights

Electronic health records are an important source of data for health research, much of the information is stored in an unstructured way and can be difficult to extract
Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets
It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets

Summary

Introduction

Electronic health records are an important source of data for health research, much of the information is stored in an unstructured way and can be difficult to extract. Research to date has predominantly used the coded data because it is readily analysed, but unstructured ‘free’ text in clinical entries may contain important information [1,2,3,4]. Manual review of free text is time-consuming and may require anonymisation to protect patient confidentiality. Medical natural language processing systems such as MedLEE [5] rely on a detailed knowledge base and manually programmed linguistic rules. Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually

Objectives

Methods

Results

Discussion

Conclusion