Abstract
Digital data of all types are being created at an ever-increasing rate, doubling approximately every two years. Annual data creation rates are estimated to reach 44 trillion gigabytes by 2020 [1]. Similarly, the rate at which primary scientific data are being collected is accelerating [2]. This astounding growth in scientific data creation has led to the contemporary discussion of scientific data sharing policies. Many of the criticisms levied against data sharing have focused on practical issues such as the economics and logistics of data storage, technical challenges for doing so, or appropriate attribution of credit [2–9]. In contrast, the arguments in favor of data sharing have focused largely on scientific replication, reproducibility [10], facilitation of collaborative research, and increased citations for publications that share data [11]. This is largely an ethical argument wherein there is an obligation to share data collected using public funds [3–6,12,13]. Rather than focusing on the much-discussed arguments against data sharing—cost, infrastructure, curation, privacy, and attribution/credit concerns—in this Perspective, I outline the overlooked benefits of data sharing: novel remixing and combining as well as bias minimization and meta-analysis. I argue that we must consider the weight of the costs against the true value of the possible benefits. If the decision for any individual researcher, university, or funding agency to implement data sharing policies comes down to a cost—benefit analysis based solely on replication versus storage, the cost—benefit analysis may be artificially tipped in favor of not sharing data caused by overlooking more subtle—but critical—benefits. These hidden benefits of data remixing cannot be appreciated when considering each individual dataset as an independent entity, and thus a richer consideration of those benefits is warranted. Although there is some evidence that, on the local scale, research groups may not make use of shared data [14], in this Perspective, I outline the ways in which research groups are beginning to take advantage of open data in novel, and sometimes surprising, ways. Rather than arguing for a centralized, large-scale data repository, I am advocating for a more organic development wherein we, institutionally, encourage the growth of a data ecosystem. This can be done via multiple venues, such as the general scientific data sharing sites figshare (https://figshare.com/) or the Dryad Digital Repository (http://datadryad.org/), each of which, in addition to Nature Publishing Group’s recently launched peer-reviewed data sharing journal, Scientific Data [15], provides citable Digital Object Identifiers for the data themselves. Such developments are addressing concerns regarding credit and help motivate data curation and contextualization. A data sharing ecosystem provides space for multiple diverse datasets to intermingle to encourage new, multidisciplinary discoveries for current and future scientists.
Highlights
Digital data of all types are being created at an ever-increasing rate, doubling approximately every two years
To give but a few examples of this: studying monkey social behaviors and eating habits led to insights into the origins of HIV [52]; research into how algae move toward light paved the way for optogenetics—using light to control neural activity [53]; and black hole research spurred the development of algorithms eventually used as part of the 802.11 specifications ubiquitously used in modern Wi-Fi [54]
It is our duty to preserve our data so that future generations will not be hindered by our prejudiced interpretations and analytical limitations
Summary
Modern science is creating data at an unprecedented rate, yet most of these data are being discarded. Raw scientific data, when they are published at all, are provided in a very limited form. Multidimensional datasets—rich with hidden information—are reduced to summary statistics filtered through limitations imposed by contemporary methods and technologies, and through the biased lens of the originating research group. The massive loss of raw data currently underway, and the lack of a system for discovering them, hinders scientific progress. In this Perspective, I argue that our contemporary limited view of the long-term scientific and medical benefits that could be made possible by data sharing masks the benefits for doing so.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.