A BERT model generates diagnostically relevant semantic embeddings from pathology synopses with active learning

Youqing Mu,Clinton J. V. Campbell,Brian Leber,Hamid R. Tizhoosh,Catherine Ross,Monalisa Sur,Rohollah Moosavi Tayebi

doi:10.1038/s43856-021-00008-0

Youqing Mu, Clinton J. V. Campbell + Show 5 more

Open Access

https://doi.org/10.1038/s43856-021-00008-0

Copy DOI

Abstract

BackgroundPathology synopses consist of semi-structured or unstructured text summarizing visual information by observing human tissue. Experts write and interpret these synopses with high domain-specific knowledge to extract tissue semantics and formulate a diagnosis in the context of ancillary testing and clinical information. The limited number of specialists available to interpret pathology synopses restricts the utility of the inherent information. Deep learning offers a tool for information extraction and automatic feature generation from complex datasets.MethodsUsing an active learning approach, we developed a set of semantic labels for bone marrow aspirate pathology synopses. We then trained a transformer-based deep-learning model to map these synopses to one or more semantic labels, and extracted learned embeddings (i.e., meaningful attributes) from the model’s hidden layer.ResultsHere we demonstrate that with a small amount of training data, a transformer-based natural language model can extract embeddings from pathology synopses that capture diagnostically relevant information. On average, these embeddings can be used to generate semantic labels mapping patients to probable diagnostic groups with a micro-average F1 score of 0.779 Â ± 0.025.ConclusionsWe provide a generalizable deep learning model and approach to unlock the semantic information inherent in pathology synopses toward improved diagnostics, biodiscovery and AI-assisted computational pathology.

Highlights

Pathology synopses consist of semi-structured or unstructured text summarizing visual information by observing human tissue
Embeddings annotated more complexly with multiple labels tended to fall between major clusters; for example, the embedding labeled with “acute leukemia; myelodysplastic syndrome” fell intermediate between the clusters representing embedding for “acute leukemia” and “myelodysplastic syndrome”. These synopses represent acute myeloid leukemia (AML) with myelodysplasia-related changes (AML-MRC), which would be conceptually expected by a hematopathologist or hematologist to have features of both semantic labels[48]. These findings suggested both that the semantic labels assigned by hematopathologists were valid, and that the embeddings generated by Bidirectional Encoded Representations of Transformers (BERT) during the development phase with active learning were diagnostically relevant and captured the morphological semantics from pathology synopses
Tools to scalably unlock the semantic knowledge contained within pathology synopses will be essential toward improved diagnostics and biodiscovery in the era of computational pathology and precision medicine[51]

Summary

Introduction

Pathology synopses consist of semi-structured or unstructured text summarizing visual information by observing human tissue. Experts write and interpret these synopses with high domain-specific knowledge to extract tissue semantics and formulate a diagnosis in the context of ancillary testing and clinical information. The limited number of specialists available to interpret pathology synopses restricts the utility of the inherent information. Deep learning offers a tool for information extraction and automatic feature generation from complex datasets

Methods

Results

Conclusion