Abstract

Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.

Highlights

  • Information on disease progression documented in electronic health records (EHRs) is a potential source of valuable new knowledge which could lead to improved health care [1,2,3]

  • Since the objective of our study was correct classification based on EHR events with potentially low expected contrast between cases and controls, we opted for this choice to 1) maximize the control information for the classifiers to be trained on, 2) minimize the risk of misclassification of disease status by ensuring to the best of our knowledge that no hospitalbased female controls were diagnosed with cervical cancer later during the study period and 3) to increase generalizability to a real-life clinical situation where EHRs are available but disease status of the individuals is not already known

  • Using machine learning for predicting cervical cancer from Swedish electronic health records representing “Biopsy of portio”, and the ICD-10 code D06 (Unspecified location of cervical cancer in situ), these findings indicate that the models learned from relevant non-spurious information

Read more

Summary

Introduction

Information on disease progression documented in electronic health records (EHRs) is a potential source of valuable new knowledge which could lead to improved health care [1,2,3]. Since EHR information is derived directly from health care, there is a great interest on how to best use this source for real-life applications by way of advanced medical informatics. Applications that can benefit from EHR mining include clinical decision support, adverse event.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call