Big Data, Little Data, No Data: Scholarship in the Networked World by Christine L. Borgman

Hallam Stevens

doi:10.1353/tech.2016.0099

Abstract

Reviewed by: Big Data, Little Data, No Data: Scholarship in the Networked World by Christine L. Borgman Hallam Stevens (bio) Big Data, Little Data, No Data: Scholarship in the Networked World. By Christine L. Borgman. Cambridge, MA: MIT Press, 2015. Pp. 400. $32. In 2012, GigaScience, based at BGI-Hong Kong, became one of the first scientific journals to publish “data” on a large scale. Supported by BGI and the open access publisher BioMed Central, GigaScience encouraged the publication of “Data Notes” that described a dataset stored on BGI’s servers. The aim of such “data publication” was to make data widely and freely available for future use by researchers other than those who produced it. But Giga-Science staff knew that this task of “making data available” entailed far more than just dumping a dataset in a database and storing it there for download. Indeed, most of GigaScience’s work consists of building novel methods of making data usable, creating environments that allow third parties to (re-)analyze archived data. Virtual machines, containers, and dockers are made to “wrap around” the data in order to permit replication of results, sharing, and validation (see, for example, http://blogs.biomedcentral.com/gigablog/2015/12/14/wish-gigachristmas-2015-wrap/). Needless to say, none of this comes easy, or cheap. The efforts of GigaScience in biomedicine capture the most critical argument of Christine Borgman’s book: making, sharing, and re-using data takes work. Many of the contemporary pronouncements about the value of data and data-sharing and data openness seem to elide this fact—data is taken to be “natural” and sharing it even more so. Big Data, Little Data, No [End Page 706] Data draws on a wide range of examples from the natural sciences, social sciences, and humanities to show how what constitutes data (or, more precisely, what is made to constitute data) varies widely both within and across scholarly domains. “Data” to an astrophysicist is not “data” to a historian. What ends up counting as data in a particular subfield is the result of a complex interaction of institutions, contingency, interests, instruments, and standards. This basic fact has wide-ranging implications for any policy that attempts to delineate how (or what) data should be captured, stored, or shared. Big Data begins by outlining various attempts to define “data” and some of the varieties of “data scholarship” and the challenges that these new fields raise (including problems of trust, collaboration, standardization, and open access). The middle section of the book elaborates detailed case studies of data generation and practice in different subfields: astronomy and environmental sensor-networks (natural sciences), the Oxford Internet Survey and “socio-technical studies” associated with the Center for Embedded Networked Sensing (social sciences), and digital archaeology and Buddhist studies (humanities). In the final three chapters, Borg-man uses these cases to spell out policy implications for data release, sharing, and re-use (chapter 8), credit (chapter 9), and long-term storage of data (chapter 10). One of the corollaries of the multiplicity of “datas” that Borgman describes is that they are never simple and straightforward units that can be easily detached from their contexts: “Research data are complex sociotechnical objects that exist within communities, not simple commodities that can be traded in a public market” (p. 213). In particular, they are tied to software, hardware, instruments, protocols, documentation, and so on. Take these things away (or change them) and one ends up with a very different kind of data (or “no data”). Data sharing is now usually seen as the key to reproducing research results, making public assets available, and realizing return on investments in research. But the complexity and multiplicity of data suggest that what “reproducibility” means (in and across research domains) is often contested, that few members of the “public” can actually make use of available data (while the private sector may benefit), and that it may be very difficult to leverage data (abstracted from its original context) into useful knowledge. The entanglement of data with the processes of its production means that “re-use” is never straightforward. The current enthusiasm for “big data” and “data openness” takes for granted that we should...

Full Text