Abstract

Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.

Highlights

  • Smoking is a major risk factor in developing coronary artery disease, chronic kidney disease, cancer, and cardiovascular disease (CVD) [1,2]. It is considered as a modifiable risk factor for CVDs and other conditions associated with premature death worldwide [3,4,5,6]

  • Our study showed the great potential in classify smoking status from bilingual unstructured electronic health records (EHR)

  • To the best of our knowledge, this paper is one of the first works that confirmed the possibility of extracting meaningful keywords from bilingual unstructured EHRs

Read more

Summary

Introduction

Smoking is a major risk factor in developing coronary artery disease, chronic kidney disease, cancer, and cardiovascular disease (CVD) [1,2]. It is considered as a modifiable risk factor for CVDs and other conditions associated with premature death worldwide [3,4,5,6]. Despite the effectiveness and importance of smoking cessation for disease prevention, smoking information is under-utilized and not measured. It is often buried in a narrative text rather than in a consistent coded form. Applying natural language processing (NLP) methods is essential in automatically transforming the clinical free text into structured clinical data, which can be further utilized by machine learning algorithms [8,9,10]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call