ExKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.

Tiancheng Yang,Ilia Sucholutsky,Matthias Schonlau,Kuang-Yu Jen

doi:10.7717/peerj-cs.1888

Tiancheng Yang, Ilia Sucholutsky + Show 2 more

Open Access

https://doi.org/10.7717/peerj-cs.1888

Copy DOI

Journal: PeerJ Computer Science	Publication Date: Feb 28, 2024
Citations: 3	License type: CC BY 4.0

Affiliation: University of Waterloo

Abstract

Pathology reports contain key information about the patient's diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) "What kind of rejection does the patient show?"; (2) "What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?" Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT's tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model's vocabulary. All three models were fine-tuned with information retrieval heads. The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer's vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ExKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science

Lead the way for us

Similar Papers

Significance of revised criteria for chronic active T cell-mediated rejection in the 2017 Banff classification: Surveillance by 1-year protocol biopsies for kidney transplantation.
Kaneyasu Nakagawa ... Kenji Ueki
American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons | VOL. 21
Kaneyasu Nakagawa, et. al.Kaneyasu Nakagawa ... Kenji Ueki
13 Jul 2020
13 Jul 2020

Subclinical Antibody-Mediated Rejection
Manuel Arias ... Maria-Angeles De Cos
Transplantation | VOL. 101
Manuel Arias, et. al.Manuel Arias ... Maria-Angeles De Cos
01 Jun 2017
Transplantation | VOL. 101

Effects of DNA Methylation on Progression to Interstitial Fibrosis and Tubular Atrophy in Renal Allograft Biopsies: A Multi-Omics Approach.
S.V Bontha ... T.F Mueller
American Journal of Transplantation | VOL. 17
S.V Bontha, et. al.S.V Bontha ... T.F Mueller
08 Jul 2017
American Journal of Transplantation | VOL. 17

Inflammation in areas of fibrosis: The DeKAF prospective cohort.
Arthur J Matas ... Fernando Cosio
American Journal of Transplantation | VOL. 20
Arthur J Matas, et. al.Arthur J Matas ... Fernando Cosio
15 Apr 2020
American Journal of Transplantation | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ExKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science