Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning.

Rohan Goli,Paul Biondich,Dean Sitting,Adam Wright,Nina Hubig,Xia Jing,Lior Rennert,David Robinson,Yang Gong,Timothy Law,Christian Nohr,Aneesa Weaver,Hua Min,Ronald Gimbel,Arild Faxvaag

doi:10.1101/2023.01.26.23285060

Abstract

Interoperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can be a critical complementary automatic part of building ontology. However, KP identification requires human expertise, consensus, and contextual understanding for data labeling. This paper presents a semi-supervised KP identification framework (long short-term memory-based encoders and the conditional random fields -based decoder models, BiLSTM-CRF) using minimal human labeled data based on hierarchical attention (i.e., at word, sentence, and abstract levels) over the documents and domain adaptation. We created synthetic labels for initial training and human-labeled data for fine-tuning. We also tested different options during NLP preprocessing and ML training to optimize the ML pipeline. Our method outperforms the prior neural architectures by learning through synthetic labels for initial training, document-level contextual learning, language modeling, and fine-tuning with limited gold standard label data. After comparison, we found that the BIO encoding schema performed slightly better than Blue, and domain adaptation techniques can improve the quality of synthetic labels. In addition, document-level context, pre-trained LM, and pre-trained WE all contributed to better model performance in our tasks. Add 2 to 4 human-labeled documents for every 100 synthetic labeled documents improves the model performance without exhausting human-labeled documents too quickly. To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify KPs, which is trained on limited human labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging, and light-weighted deep learning models play an important role in real-time KP identification as a complementary approach to human experts' effort.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences

Lead the way for us

Journal: medRxiv : the preprint server for health sciences	Publication Date: Nov 18, 2024
License type: CC BY-NC-ND 4.0

Similar Papers

Data, Machine Learning, and Human Domain Experts: None Is Better than Their Collaboration
Pawan Kumar ... Manmohan Sharma
International Journal of Human–Computer Interaction | VOL. ahead-of-print
Pawan Kumar, et. al.Pawan Kumar ... Manmohan Sharma
16 Dec 2021
International Journal of Human–Computer Interaction | VOL. ahead-of-print

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.
Anastassia Shaitarova ... Alberto Lavelli
Yearbook of Medical Informatics | VOL. 32
Anastassia Shaitarova, et. al.Anastassia Shaitarova ... Alberto Lavelli
01 Aug 2023
Yearbook of Medical Informatics | VOL. 32

Natural language processing in biomedicine: a unified system architecture overview.
Son Doan ... Lucila Ohno-Machado
Methods in molecular biology (Clifton, N.J.) | VOL. 1168
Son Doan, et. al.Son Doan ... Lucila Ohno-Machado
01 Jan 2014
Methods in molecular biology (Clifton, N.J.) | VOL. 1168

Applications of Natural Language Processing in Clinical Research and Practice
Yanshan Wang ... Rui Zhang
-
Yanshan Wang, et. al.Yanshan Wang ... Rui Zhang
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences