Background and ObjectiveChronic cough (CC) affects approximately 10% of adults. Many disease states are associated with chronic cough, such as asthma, upper airway cough syndrome, bronchitis, and gastroesophageal reflux disease. The lack of an ICD code specific for chronic cough makes it challenging to identify such patients from electronic health records (EHRs). For clinical and research purposes, computational methods using EHR data are urgently needed to identify chronic cough cases. This research aims to investigate the data representations and deep learning algorithms for chronic cough prediction. MethodsUtilizing real-world EHR data from a large academic healthcare system from October 2005 to September 2015, we investigated Natural Language Representation of the EHR data and systematically evaluated deep learning and traditional machine learning models to predict chronic cough patients. We built these machine learning models using structured data (medication and diagnosis) and unstructured data (clinical notes). ResultsThe sensitivity and specificity of a transformer-based deep learning algorithm, specifically BERT with attention model, was 0.856 and 0.866, respectively, using structured data (medication and diagnosis). Sensitivity and specificity improved to 0.952 and 0.930 when we combined structured data with symptoms extracted from clinical notes. We further found that the attention mechanism of deep learning models can be used to extract important features that drive the prediction decisions. Compared with our previously published rule-based algorithm, the deep learning algorithm can identify more chronic cough patients with structured data. ConclusionsBy applying deep learning models, chronic cough patients can be reliably identified for prospective or retrospective research through medication and diagnosis data, widely available in EHR and electronic claims data, thus improving the generalizability of the patient identification algorithm. Deep learning models can identify chronic cough patients with even higher sensitivity and specificity when structured and unstructured EHR data are utilized. We anticipate language-based data representation and deep learning models developed in this research could also be productively used for other disease prediction and case identification.