Population-based cancer registries (PBCRs) collect data on all new cancer diagnoses in a defined population. Data are sourced from pathology reports, and the PBCRs rely on manual and rule-based solutions. This study presents a state-of-the-art natural language processing (NLP) pipeline, built by fine-tuning pretrained language models (LMs). The pipeline is deployed at the British Columbia Cancer Registry (BCCR) to detect reportable tumors from a population-based feed of electronic pathology. We fine-tune two publicly available LMs, GatorTron and BlueBERT, which are pretrained on clinical text. Fine-tuning is done using BCCR's pathology reports. For the final decision making, we combine both models' output using an OR approach. The fine-tuning data set consisted of 40,000 reports from the diagnosis year of 2021, and the test data sets consisted of 10,000 reports from the diagnosis year 2021, 20,000 reports from diagnosis year 2022, and 400 reports from diagnosis year 2023. The retrospective evaluation of our proposed approach showed boosted reportable accuracy, maintaining the true reportable threshold of 98%. Disadvantages of rule-based NLP in cancer surveillance include manual effort in rule design and sensitivity to language change. Deep learning approaches demonstrate superior performance in classification. PBCRs distinguish reportability status of incoming electronic cancer pathology reports. Deep learning methods provide significant advantages over rule-based NLP.
Read full abstract