Identifying Data Sharing in Biomedical Literature

Heather Piwowar,Wendy Chapman

doi:10.1038/npre.2008.1721.2

Abstract

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Highlights

Introduction and MotivationReusing primary research data has many benefits for the progress of science
We propose an alternative approach: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles
We developed a pilot NLP application to identify references to data sharing in the biomedical literature and compared its predictive performance against a reference standard of bibliographic citations associated with dataset submissions

Summary

Introduction and Motivation

Reusing primary research data has many benefits for the progress of science. For example, new studies advance more quickly and inexpensively when duplicate data collection is reduced, rare conditions can often be explored only through combining several datasets, and new computational methods can be evaluated through re-analysis. Previous assessments of data sharing have included manual curation, investigator self-reporting, and the analysis of citations within database submission entries. These methods are only able to identify instances of data sharing and data withholding in a limited number of cases and contexts. Method We developed a pilot NLP application to identify references to data sharing in the biomedical literature and compared its predictive performance against a reference standard of bibliographic citations associated with dataset submissions. We implemented two approaches for classifying articles as either containing or not containing text indicating a database submission: a set of regular expression patterns to identify relevant lexical cues and a machine learning approach. We applied a variety of machine learning algorithms (trees, rules, Naïve Bayes, and support vector machines) and found similar performance; we report the results with J48 trees since it had the best performance and trees are transparent, portable, and easy to implement

Evaluation Method

Results

Discussion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature Precedings	Publication Date: Aug 4, 2008
Citations: 25	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Identifying Data Sharing in Biomedical Literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Precedings

Lead the way for us

Similar Papers

Identifying Data Sharing in Biomedical Literature
Heather Piwowar ... Wendy Chapman
Nature Precedings | VOL. -
Heather Piwowar, et. al.Heather Piwowar ... Wendy Chapman
25 Mar 2008
Nature Precedings | VOL. -

Patterns and Types for Querying XML Documents
Giuseppe Castagna
-
Giuseppe CastagnaGiuseppe Castagna
01 Jan 2004
01 Jan 2004

Детектирование и классификация сетевых атак с помощью Splunk Machine Learning Toolkit
D.Zh Satybaldina ... A.K Seksenbaeva
BULLETIN of the L N Gumilyov Eurasian National University MATHEMATICS COMPUTER SCIENCE MECHANICS Series | VOL. 142
D.Zh Satybaldina, et. al.D.Zh Satybaldina ... A.K Seksenbaeva
30 Mar 2023
BULLETIN of the L N Gumilyov Eurasian National University MATHEMATICS COMPUTER SCIENCE MECHANICS Series | VOL. 142

Regular expression patterns
Niklas Broberg ... Andreas Farre
-
Niklas Broberg, et. al.Niklas Broberg ... Andreas Farre
19 Sep 2004
19 Sep 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identifying Data Sharing in Biomedical Literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Precedings