Even by the journal’s own standards, this was a wild claim. In July 2008, Wired magazine announced on its cover nothing less than ‘‘The End of Science’’. It explained that ‘‘The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data’’. Such claims about the emergence of a new ‘‘data-driven’’ science in response to a ‘‘data deluge’’ have now become common, from the pages of The Economist to those of Nature. Proponents of ‘‘data-driven’’ and ‘‘hypothesis-driven’’ science argue over the best methods to turn massive amounts of data into knowledge. Instead of jumping into the fray, I would like to historicize some of the questions and problems raised by data-driven science, taking as a point of departure the three rich papers by Isabelle Charmantier and Staffan Muller-Wille on Linnaeus’ information processing strategies, Sabina Leonelli and Rachel Ankeny on model organisms databases, and Peter Keating and Alberto Cambrosio on microarray data in clinical research. That a historical approach is warranted is made clear by the remark of the great book historian Robert Darnton that ‘‘every age was an age of information, each in its own way’’ (Darnton, 2000, p. 1). In particular, perceptions of an ‘‘information overload’’ (or a ‘‘data deluge’’) have emerged repeatedly from the Renaissance though the early modern and modern periods and each time specific technologies were invented to deal with the perceived overload (Ogilvie, 2003; Rosenberg, 2003). This commentary will explore the similarities and differences between past and present data-driven life sciences, from early modern natural history to current post-genomics. Renaissance naturalists were no less inundated with new information than our contemporaries. The expansion of travel, epitomized by the discovery of the New World, exposed European naturalists to new facts that did not fit into the systems of knowledge inherited from the Greeks and Romans. This prompted those interested in understanding the natural world to devise newmethods for managing this data, such as note-taking strategies, and new systems of classification (Blair, 2010; Ogilvie, 2006). Ironically, as Charmantier and Muller-Wille point out, these methods and systems, which were meant to tame the information overload, made it possible to accumulate even more data. But accumulation was usually only a mean to an end. These early naturalists established collections, which included specimens, drawings, and texts, so that they could compare these items systematically and draw from the comparisons conclusions about the natural world. In general, they were not testing specific hypotheses, but trying to bring order to the bewildering diversity of natural forms by examining large amounts of collected ‘‘data’’. This tradition continues to be central in natural history to the present day. As George Gaylord Simpson, the leading American paleontologist of the twentieth century,made clear in 1961, natural history, and taxonomy in particular, was the ‘‘science that is most explicitly and exclusively devoted to the ordering of complex data’’ (Simpson, 1961, p. 5). What is striking about Simpson’s definition is not only that he chose the ‘‘ordering of complex data’’ as the most essential element of natural history, but also how similar his definition is to current characterizations of the supposedly unprecedented data-driven sciences. This should come as no surprise since, for several centuries, the natural historical sciences have fundamentally been data-driven sciences. But was natural history driven by data alone? Most likely not, because natural history has never been free of ontological assumptions. For example most naturalists assume the existence of natural groups. As Charmantier and Muller-Wille show, Linnaeus who struggled with a data deluge of his own creation and devised numerous note-taking methods to deal with it, could only do so because he began with a hypothesis about the genus categories he used to organize his data. In other words, Linnaeus may have been driven by his data, but his approach was not exclusively datadriven. This conclusion, however, is insufficient to distinguish early modern approaches to data with contemporary ones. Indeed, as Keating and Cambrosio show in their paper, modern day biostatisticians analyzing cancer microarray data were equally driven by various hypotheses. For example, the determination of the sample size needed to produce statistically significant results required researchers to make an hypothesis about the number of classes that the data might reveal. In other words, they too were guided by
Read full abstract