The Automatic Detection of Dataset Names in Scientific Articles

Jenny Heddes,Maarten Marx,Miguel Pieters,Pim Meerdink

doi:10.3390/data6080084

Abstract

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.

Highlights

We study the task of recognizing named datasets in scientific articles as a Named Entity
Notice that the partial and exact match scores are closest for SciBERT
An error analysis shows that SciBERT is especially good in learning the beginning of a dataset mention

Summary

Related Work

The overwhelming volume of scientific papers have made extracting knowledge from them an unmanageable task [30], making automatic IE especially relevant for this domain [31]. Research on the dataset name extraction task uses a great variety of methods throughout the NER spectrum, including, but not limited to, the following: rule-based, BiLSTM-CRF and BERT [3,6,7,8,9,10,11,12,13,14]. Computer science papers, and the remaining 82% consisting of papers from the biomedical domain This model, which was specially created for knowledge extraction in the scientific domain, achieves better performance, in comparison to BERT, in the computer science domain

Description of the Data

Origins

Annotation

Train and Test Sets

Experimental Setup

Results

Evaluation

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data	Publication Date: Aug 4, 2021
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The Automatic Detection of Dataset Names in Scientific Articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data

Lead the way for us

Similar Papers

BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research
Angela Shannen Tan ... Roselyn Gabud
Biodiversity Information Science and Standards | VOL. 8
Angela Shannen Tan, et. al.Angela Shannen Tan ... Roselyn Gabud
29 Oct 2024
Biodiversity Information Science and Standards | VOL. 8

ReQue
Mahtab Tamannaee ... Hossein Fani
-
Mahtab Tamannaee, et. al.Mahtab Tamannaee ... Hossein Fani
19 Oct 2020
19 Oct 2020

Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences
Serena Lotreck ... Mohammad Ghassemi
in silico Plants | VOL. 6
Serena Lotreck, et. al.Serena Lotreck ... Mohammad Ghassemi
11 Nov 2023
in silico Plants | VOL. 6

A new 2D-3D registration gold-standard dataset for the hip joint based on uncertainty modeling.
Fabio D'Isidoro ... Stephen J Ferguson
Medical Physics | VOL. 48
Fabio D'Isidoro, et. al.Fabio D'Isidoro ... Stephen J Ferguson
17 Aug 2021
Medical Physics | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Automatic Detection of Dataset Names in Scientific Articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data