Data Publishing and Scientific Journals: The Future of the Scientific Paper in a World of Shared Data

Erik De Schutter

doi:10.1007/s12021-010-9084-8

Abstract

The rapid growth of the internet and related technologieshas already had a tremendous impact on scientific publish-ing. This journal has given attention to open accesspublishing (Ascoli 2005; Bug 2005; Merkel-Sobotta 2005;Velterop 2005), to reforming the review process (DeSchutter 2007; Saper and Maunsell 2009) and to theproblems with getting authors to share their data (Ascoli2006; Kennedy 2006; Teeters et al. 2008; Van Horn andBall 2008) and how to enhance the use of shared data(Gardner et al. 2008; Kennedy 2010).But the impact of the internet and data warehousing onscience will be much larger and there is a growing interestin how these technologies can be leveraged to improve thescientific process (Hey et al. 2009). Let’s travel towards thefuture and imagine that not only the tools and infrastructureare available to share scientific data at any time after it isgenerated, but that it has also become standard practice forthe community to do so. How this can be achieved is notthe focus of this editorial, instead I want to speculate on therelationship between scientific papers and data repositories(Bourne 2005, 2010; Cinkosky et al. 1991) in such anenvironment. It is important for the scientific community todiscuss these issues now because, while these technologiesare expected to radically improve the scientific process,they will also change the way in which our work isevaluated.I propose that we should distinguish data publishingfrom paper publishing (Callaghan et al. 2009; Cinkosky etal. 1991) and, when established for specific scientific fields,promote data publishing as the primary outlet for much ofthe scientific output.A good metaphor for data publishing is to look at howcomplete organism genomic sequences are published inhigh impact journals now (Srivastava et al. 2010; Warren etal. 2010). Such papers serve really two goals: to announcethe availability of the genome sequence in GenBank and todescribe some scientific conclusions based on the analysisof the genome. The perceived importance of the latterdetermines whether a high impact journal will accept thepaper and therefore the authors spend a lot of effort inhyping this part. But are these two components irrevocablyintertwined? Couldn’t one just publish the data, in this caseby depositing the complete sequence in a database, andannounce this fact through a form of publication? Theanalysis can then be published separately at a later time ordistributed over different papers, etc. This is not donebecause at present the publication of the paper in the highimpact journal is considered to be the optimal reward forthe researchers, both for career advancement and forsuccess in obtaining new grants (Bourne 2005). I call datapublication a method where the data providers, who may bedifferent from the people who analyze the data, receivecredit for their work when they deposit the sequence in thedatabase and where subsequent access to the data is trackedand considered equivalent to paper citation.There are a number of advantages to considering datapublication as a separate process. First, credit assignmentbecomes more explicitly defined among the authors.Several journals (like Nature, Science, the PLoS series,etc.) have taken steps towards a more granular creditassignment by asking authors to explicitly list their

Full Text