ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Nico Riedel,Evgeny Bobrov,Miriam Kip

doi:10.5334/dsj-2020-042

Nico Riedel, Evgeny Bobrov + Show 1 more

Open Access

https://doi.org/10.5334/dsj-2020-042

Copy DOI

Abstract

Open research data are increasingly recognized as a quality indicator and an important resource to increase transparency, robustness and collaboration in science. However, no standardized way of reporting Open Data in publications exists, making it difficult to find shared datasets and assess the prevalence of Open Data in an automated fashion. We developed ODDPub (Open Data Detection in Publications), a text-mining algorithm that screens biomedical publications and detects cases of Open Data. Using English-language original research publications from a single biomedical research institution (n = 8689) and randomly selected from PubMed (n = 1500) we iteratively developed a set of derived keyword categories. ODDPub can detect data sharing through field-specific repositories, general-purpose repositories or the supplement. Additionally, it can detect shared analysis code (Open Code). To validate ODDPub, we manually screened 792 publications randomly selected from PubMed. On this validation dataset, our algorithm detected Open Data publications with a sensitivity of 0.73 and specificity of 0.97. Open Data was detected for 11.5% (n = 91) of publications. Open Code was detected for 1.4% (n = 11) of publications with a sensitivity of 0.73 and specificity of 1.00. We compared our results to the linked datasets found in the databases PubMed and Web of Science. Our algorithm can automatically screen large numbers of publications for Open Data. It can thus be used to assess Open Data sharing rates on the level of subject areas, journals, or institutions. It can also identify individual Open Data publications in a larger publication corpus. ODDPub is published as an R package on GitHub.

Highlights

The benefits of openly shared research data (Open Data) for science and society as a whole are manifold, including increases in reproducibility, resource efficiency, economic growth, and public trust in research, as stated e.g. by the Concordat on Open Research Data, drafted by Universities UK, Research Councils UK, HEFCE and Wellcome Trust (The Concordat Working Group 2016)
The explanations hold true for the detection of both Open Data and Open Code
We developed a text-mining algorithm that can screen biomedical publications for mentions of Open Data sharing, as well as the provision of self-developed code (Open Code)

Summary

Introduction

The benefits of openly shared research data (Open Data) for science and society as a whole are manifold, including increases in reproducibility, resource efficiency, economic growth, and public trust in research, as stated e.g. by the Concordat on Open Research Data, drafted by Universities UK, Research Councils UK, HEFCE and Wellcome Trust (The Concordat Working Group 2016). Open Data are increasingly recognized as both an indicator of quality and an important resource that can be reused by other scientists and that can increase transparency, robustness, and collaboration in science (Fecher, Friesike, and Hebing 2015). Several large funders and institutions advocate for Open Data, e.g. the European Commission, NIH, EUA, RDA Europe (Guedj and Ramjoué 2015; NIH 2003; EUA 2017). Large infrastructure projects are under way, including the European Open Science Cloud (“EOSC Declaration” 2017). Open Code, i.e. analysis code disseminated with the publication, is recognized as an element to further enhance transparency of a study by making available the analysis steps leading from data to results, which is increasingly reflected in journal policies (Stodden, Guo, and Ma 2013)

Methods

Results

Conclusion