Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Amalie Trewartha,Nicholas Walker,Haoyan Huo,Sanghoon Lee,Kevin Cruse,John Dagdelen,Alexander Dunn,Kristin A Persson,Gerbrand Ceder,Anubhav Jain

doi:10.1016/j.patter.2022.100488

Amalie Trewartha, Nicholas Walker + Show 8 more

Open Access

https://doi.org/10.1016/j.patter.2022.100488

Copy DOI

Abstract

SummaryA bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Patterns	Publication Date: Apr 1, 2022
Citations: 70	License type: cc-by

R Discovery Prime

R Discovery Prime

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Abstract

Talk to us

Similar Papers

More From: Patterns

Lead the way for us

Similar Papers

BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research
Angela Shannen Tan ... Roselyn Gabud
Biodiversity Information Science and Standards | VOL. 8
Angela Shannen Tan, et. al.Angela Shannen Tan ... Roselyn Gabud
29 Oct 2024
Biodiversity Information Science and Standards | VOL. 8

Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction
Hermenegildo Fabregat ... Lourdes Araujo
Journal of Biomedical Informatics | VOL. 138
Hermenegildo Fabregat, et. al.Hermenegildo Fabregat ... Lourdes Araujo
04 Jan 2023
Journal of Biomedical Informatics | VOL. 138

Using Recurrent Neural Networks to Extract High-Quality Information From Lung Cancer Screening Computerized Tomography Reports for Inter-Radiologist Audit and Feedback Quality Improvement.
Yucheng Zhang ... Andrew C.L Lam
JCO Clinical Cancer Informatics | VOL. 7
Yucheng Zhang, et. al.Yucheng Zhang ... Andrew C.L Lam
01 Mar 2023
JCO Clinical Cancer Informatics | VOL. 7

A Joint Learning Model to Extract Entities and Relations for Chinese Literature Based on Self-Attention
Li-Xin Liang ... E Lin
Mathematics | VOL. 10
Li-Xin Liang, et. al.Li-Xin Liang ... E Lin
24 Jun 2022
Mathematics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Abstract

Talk to us

Similar Papers

More From: Patterns