Abstract
Recent applications of deep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
Highlights
The National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) program works with cancer registries to extract key cancer characteristics from healthcare records to create national estimates of cancer incidence
We are motivated by the performance difficulties of deep learning models on rare classes for clinical text classification [3] to propose a strategy for incorporating keywords into model training
We identified keywords associated with the ICD-O-3 from the “concept names” listed in the Unified Medical Language System (UMLS)[18] concept unique identifiers (CUI) dictionary
Summary
The National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) program works with cancer registries to extract key cancer characteristics from healthcare records to create national estimates of cancer incidence. A key step in this process is the extraction of tumor characteristics including site, subsite, and histology, from electronic pathology reports. The reports provide a rich source of information to track diagnoses, treatments, and outcomes. Data in current registries is primarily in the form of unstructured text, making automatic information extraction difficult [1]. To overcome the challenges associated with unstructured text, previous work has employed deep learning models for document classification with promising results [2]–[4]. Deep learning approaches have been successful, the class imbalance inherent in registries’ datasets continues to be a key challenge to training robust production models
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.