BackgroundTraditionally, cases for cohort selection and quality assurance purposes are identified through structured query language (SQL) searches matching specific keywords. Recently, several neural network-based natural language processing (NLP) pipelines have emerged as an accurate alternative/complementary method for case retrieval. MethodsThe diagnosis section of 1000 pathology reports with the terms “colon” and “carcinoma” were retrieved from our laboratory information system through a SQL query. Each of the reports were labeled as either positive or negative, where cases are considered positive if the case was a primary adenocarcinoma of the colon. Negative cases comprised adenocarcinoma from other sites, metastatic adenocarcinomas, benign conditions, rectal cancers, and other cases that do not fit in the primary colonic adenocarcinoma category. The 1000 cases were randomly separated into training, validation, and holdout sets. A convolutional neural network (CNN) model built using Keras (a neural network library) was trained to identify positive cases, and the model was applied to the holdout set to predict the category for each case. ResultsThe CNN model classified 141 out of 149 primary colonic adenocarcinoma cases, and 43 out of 51 negative cases correctly, achieving an accuracy of 92% and area under the ROC curve (AUC) of 0.957. ConclusionTrained convolutional neural network models by itself, or as an adjunct to keyword and pattern-based text extraction methods may be used to search for pathology cases of interest with high accuracy.
Read full abstract