Abstract

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Highlights

  • Introduction and MotivationReusing primary research data has many benefits for the progress of science

  • We propose an alternative approach: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles

  • We developed a pilot NLP application to identify references to data sharing in the biomedical literature and compared its predictive performance against a reference standard of bibliographic citations associated with dataset submissions

Read more

Summary

Introduction and Motivation

Reusing primary research data has many benefits for the progress of science. For example, new studies advance more quickly and inexpensively when duplicate data collection is reduced, rare conditions can often be explored only through combining several datasets, and new computational methods can be evaluated through re-analysis. Previous assessments of data sharing have included manual curation, investigator self-reporting, and the analysis of citations within database submission entries. These methods are only able to identify instances of data sharing and data withholding in a limited number of cases and contexts. Method We developed a pilot NLP application to identify references to data sharing in the biomedical literature and compared its predictive performance against a reference standard of bibliographic citations associated with dataset submissions. We implemented two approaches for classifying articles as either containing or not containing text indicating a database submission: a set of regular expression patterns to identify relevant lexical cues and a machine learning approach. We applied a variety of machine learning algorithms (trees, rules, Naïve Bayes, and support vector machines) and found similar performance; we report the results with J48 trees since it had the best performance and trees are transparent, portable, and easy to implement

Evaluation Method
Results
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.