Abstract

ObjectiveWe aim to build an accurate machine learning-based system for classifying tumor attributes from cancer pathology reports in the presence of a small amount of annotated data, motivated by the expensive and time-consuming nature of pathology report annotation. An enriched labeling scheme that includes the location of relevant information along with the final label is used along with a corresponding hierarchical method for classifying reports that leverages these enriched annotations. Materials and methodsOur data consists of 250 colon cancer and 250 kidney cancer pathology reports from 2002 to 2019 at the University of California, San Francisco. For each report, we classify attributes such as procedure performed, tumor grade, and tumor site. For each attribute and document, an annotator trained by an oncologist labeled both the value of that attribute as well as the specific lines in the document that indicated the value. We develop a model that uses these enriched annotations that first predicts the relevant lines of the document, then predicts the final value given the predicted lines. We compare our model to multiple state-of-the-art methods for classifying tumor attributes from pathology reports. ResultsOur results show that across colon and kidney cancers and varying training set sizes, our hierarchical method consistently outperforms state-of-the-art methods. Furthermore, performance comparable to these methods can be achieved with approximately half the amount of labeled data. ConclusionDocument annotations that are enriched with location information are shown to greatly increase the sample efficiency of machine learning methods for classifying attributes of pathology reports.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call