Addressing the policy challenges and opportunities of “Big data”

Helen Margetts,David Sutcliffe

doi:10.1002/1944-2866.poi326

Abstract

This editorial introduces the papers in issue 5 (2) of Policy & Internet; a special issue that addresses the challenges and opportunities of “big data” for academics and policymakers. Vast amounts of transactional data are collected about us as we go about our daily lives online. Shopping, mobile phones, using transport, entertaining ourselves—all leave a data trail that enables companies, even the state, to track the most mundane aspects of our lives. Add to this all the personal data we willingly share on blogs and social networks like Facebook and Twitter, and academics potentially have access to terabytes of data that can provide new and surprising insights into human behavior and social structure. This “big data” not only offers enormous scope for understanding citizens' willingness—or unwillingness—in terms of civic engagement; it allows social and political scientists to tackle longstanding problems that have hitherto been impossible to address, such as how political movements like the “Arab Spring” and Occupy originate and spread (Bastos, Travitzki, & Raimundo, 2012; Lindgren, 2012). It also holds promise in terms of the design of efficient policy and administrative change. However, collection and use of these data also raises a whole range of ethical challenges (Mayer-Schönberger & Cukier, 2013). Obvious issues surround data protection and privacy of subjects (Schroeder & Meyer, 2012), for example individual identities become discoverable from supposedly disaggregated and “anonymized” data sets; the trail of digital footprints we leave is certainly long and hard to erase (Mayer-Schönberger, 2009). Big data can also be used for algorithmic and probabilistic policymaking, raising issues of justice and equity. It holds the potential for more coercive modes of governance, whether by introducing conditionality into public policy and services or simply exerting “nudges.” We are also faced with the challenge that traditional databases and desktop computers may be inadequate for the collection, storage, and analysis of larger, more complex social media data sets. More computing power, alternative tools for storage, and new methods for utilizing computational clusters are needed, including computing system architectures suited to constant real-time data collection, and new forms of fast, flexible data storage for retrieval and curation (Walker, Eckert, Hemsley, Mason, & Nahon, 2012). These data are also often the result of “mechanisms of observation, inscription, and representation that serve specific ends—ends, in the case of big data, which are very often commercial” (Barocas, 2012). This raises its own challenges: not only in terms of how we obtain access to proprietary or walled data (e.g., from Google or Facebook), but also how we use it to answer questions that are of interest to us as social scientists, particularly when we have no control over what data are collected or how it is done. Given the newness and largely unexplored potential of this research and policy landscape we are very pleased to present this special issue on big data, which gathers together selected papers first presented at the journal's conference Internet, Politics, Policy 2012: Big Data, Big Challenges?1 Held in Oxford last September, the conference explored the new research and policy frontiers opened up by big data, aiming to encourage discussion across disciplinary boundaries on how to exploit it to inform policy debates and advance social science research. What was clear from the discussions was that this area is likely to benefit from genuinely multidisciplinary approaches; that it could lead to the solution of many longstanding research “problems” in the social sciences; and that it is likely to bring about substantial change in the policymaking process. It is difficult to overemphasize the importance of a multidisciplinary approach, particularly as a means of connecting the social sciences with the important—and relevant—work being done, for example, in the physical sciences, computer sciences, and mathematics. As one of our keynote speakers Duncan Watts pointed out, what is needed is a “dating service” for engineers (who have the data skills) and social scientists (who have the interesting research questions that these data might answer). This is something the conference set out to do, and indeed this journal more generally. Despite all the recent buzz (some might say hype) around “big data” we were initially unsure how much interest there would be in a two-day academic conference on the topic, not only given that it is a relatively new talking point—and lead times in academia are not short!—but also that “big data” is such an ontologically tricky concept (Schroeder & Meyer, 2012). However, we plunged on regardless, and were gratified by the genuinely warm reception and interest received by the call for papers. Papers were presented across three tracks, with panel sessions not only on traditional areas like political campaigning, legislation and public policy, and government; but also on new research areas that simply would not exist without access to these data: for example, sentiment analysis of entire populations, predictive modeling in economics and politics, and the large-scale information dynamics of the “Arab Spring.” Jonathan Bright, winner of one of the conference's “best paper” awards analyzed the more than 740 million words spoken in the U.K.'s House of Commons since 1936 to uncover the dynamics of parliamentary discourse; how different members are treated, and how quantity of interventions and types of topic being debated change over time (Bright, 2012). Parliamentary discourse is one of the most important mechanisms through which democracy functions; being able to analyze the complete digital record of 75 years of debate opens a whole new window on political research. While the term “big data” has been bandied about (particularly in the commercial world) for a while, there is still very little work and discussion on the potentials of very large data sets—for example, dealing with populations of millions of individuals—for social science research. There certainly was a real sense at the conference that we might be seeing a step-change in our ability to analyze and draw new conclusions from these data; for example on collective action and dissent (Procter, 2012), information diffusion and influence (Freelon, 2012), prediction of economic trends (Metreveli, 2012), or using secondary data to inform policy (Baumann & Eulenstein, 2012; Blake, 2012). But there was also a countervailing view that simply accessing and manipulating large data sets was not sufficient; that we need to develop new theory and questions (González-Bailón, 2013). Big data does not necessarily mean interesting data. And while social scientists may not generally have access to the brute machine force of a physical or computer science lab, it is important that we are part of the “conversation” as big data-driven research capabilities open up and develop. Working with academics from other disciplines and academic traditions will form an important part of this. The potentials of these data are something that government is becoming increasingly (if patchily) aware of; after all, they sit on, guard, and process vast quantities of data about us. Our opening keynote speaker Nigel Shadbolt (2012) is co-director of the U.K.'s Open Data Institute, which aims to catalyze the evolution of open data culture to create economic, environmental, and social value. In 2009 he was appointed by Prime Minister Gordon Brown as an Information Advisor to transform access to public sector information; this work leading to the highly acclaimed data.gov.uk site, which now provides a portal to over 9,000 government data sets. Potentials for government of more open and “big data”-driven policy include efficiency savings, crowd sourcing of knowledge (Stottlemyre & Stottlemyre, 2012), and harnessing of citizen-based innovation. However, implementation of these data policies could face substantial barriers, ranging from organizational and structural inefficiencies, conflicting or legacy legislation, lack of instruments of implementation and technical expertise, or simple lack of understanding of the utility of these data by decision makers (Stefaneas, Tsiavos, & Karounos, 2012). Few public administrators, and ever fewer elected officials, have access to—or knowledge of—the statistical skills and relevant case examples necessary to fully utilize the information available from burgeoning mega-databases; application of big data sets to public sector decision-making processes may therefore require a merger of private sector methodology, public administrative expertise, and political leadership (Milakovich, 2012). These data will also require the development of new methods and tools, for example using Web crawlers and archives to map the presence of government over time (Hale, 2012), extracting reliable public opinion indicators and approval ratings from the sentiments encoded in online communications (Lansdall-Welfare, Lampos, & Cristianini, 2012), or simply solving problems of scalable data extraction and automation (Furche, Gottlob, Grasso, & Schallhart, 2012; Sudhahar, Lansdall-Welfare, Flaounas, & Cristianini, 2012). In this special issue, Jensen and Anstead (2013) analyze the utility of Twitter posts in predicting outcomes of the 2011 Republican nomination, comparing aggregated polling figures and electoral results with over a million tweets relating to the nomination process. They consider three categories of models. The first is a mentions model that examines the correspondence between the volume of communications about a candidate and electoral outcomes. The second treats Twitter similarly to a prediction market, aggregating predictions of the electoral result. They lastly consider whether rediffusion of tweets about a candidate is a reliable predictor of their performance. While they find inconsistent support for the predictive value of Twitter mentions as an estimate of the overall vote, they find some evidence of otherwise undetected shifts in momentum in terms of aggregated predictions of candidate performance and message diffusion via retweets. Also in this issue, Aragón, Kappler, Kaltenbrunner, Laniado, and Volkovich (2013) analyze over three million tweets to examine the activity, emotional content, and interactions of political parties and politicians during the 2011 Spanish national elections. They discuss the adaptation of political parties to the new (and still unregulated by election laws) communication and organizational paradigm brought about by online social networks, analyzing the reply and retweet networks of seven political parties to assess their information dynamics. They find that political parties, especially the major traditional parties, still use Twitter as a one-way flow communication tool. Moreover, they find evidence of a balkanization trend in the Spanish online political sphere, as observed in previous studies for other countries. While use of networked social media to investigate large-scale information dynamics is still in its early days (still largely concerned with single platforms, for example), it will be interesting to see what work emerges in this area, particularly in terms of online publics, and public opinion monitoring and modeling. Our next article, by Koltsova and Koltcov (2013) looks at the agenda setting potential of the top bloggers on Russia's leading blog platform LiveJournal. They use methods from computer science to model its topic structure, finding that LiveJournal's top users share their attention evenly between “social/political” and “private/recreational” issues, the latter being very stable, while the influence of the Russian street protests in 2011 is clearly visible in the blogs' political content. The greatest volatility is shown by the group of topics centering on social issues; possibly serving as an online public opinion barometer that could be applied for proactive policy making. This is certainly an interesting area. Chadefaux (2013; also a winner of one of the conference's “best paper” awards) analyzed a comprehensive data set of historical newspaper articles for 166 countries since 1900, testing against the two-hundred conflicts that occurred in that period. Using only information available at the time, he could predict the onset of a war within the next year with up to 85 percent confidence; also forecasting over 70 percent of large-scale wars, while issuing false alarms in only 16 percent of observations. Also in terms of public “barometers,” Lansdall-Welfare et al. (2012) analyzed 484 million tweets to examine the effect of the recession on the U.K.'s collective mood, finding negative shifts at the time of the spending cut announcements and the August 2011 riots. They propose that constant “nowcasting” of certain collective properties of society is possible by monitoring the contents of social media. Any predictive or monitoring capacity will obviously be of interest to governments and policymakers. However, big data generation and analysis requires expertise and skills that can be a particular challenge to governmental organizations, given their dubious record on the guardianship of large-scale data sets, the management of large technology-based projects, and capacity to innovate. In 2010, the U.K.'s incoming Coalition Government set up the Behavioural Insights Team (aka “The Nudge Unit”) in the Cabinet Office to find innovative and cost effective (i.e., cheap) ways to change people's behavior; encouraging us to eat more healthily, save money, and stop smoking. However, very few of these experiments have used manipulation of online information environments as a way to do it. There is a definite sense that governments are still feeling their way in the area of innovative use of online data, and there are questions about whether there is even the right expertise within government departments; compared with, say, academia and business. For example, very interesting work on modeling of networked diffusion is being done by our second keynote speaker, Duncan Watts (2012), a principal researcher at Microsoft research and previously principal research scientist at Yahoo! Research, where he directed the Human Social Dynamics group. He described the diffusion patterns arising from a range of online domains—communications platforms, networked games, and microblogging services—each involving distinct types of content and modes of sharings (Goel, Watts, & Goldstein, 2012). Strikingly similar patterns were found across all domains; ones that do not fit the “spread of infectious disease” analogy commonly applied to a wide range of social and economic adoption processes, including those related to new products, ideas, norms, and behaviors. They find instead that the vast majority of cascades are small and simple, terminating within one degree of an initial adopting “seed.” In addition they find that adoptions resulting from chains of referrals are extremely rare; even for the largest cascades observed, the bulk of adoptions often took place within one degree of a few dominant individuals. Together, these observations—which could only have been obtained using large networked populations—suggest we need to reconsider out assumptions about online adoption processes. (Webcasts of the keynote addresses by Nigel Shadbolt and Duncan Watts are available on the conference website.) So there are certainly lots of exciting things that can be (and are being) done, many of which were presented and discussed at the conference. But is “big data” really a universal panaceum? Taking one specific case, there is certainly a great deal of enthusiasm about the prospects for the vast quantities of data held in health care systems around the world. Health care appears to offer the ideal combination of circumstances for its exploitation, with a need to improve productivity on the one hand, and the availability of data that can be used to identify opportunities for improvement on the other. In this issue, Keen, Calinescu, Paige, and Rooksby (2013) argue that this enthusiasm rests on two assumptions. First, that the data sets held by hospitals and other organizations, and the technological infrastructure needed for their acquisition, storage, and manipulation, are up to the task. Second, that organizations outside health care systems will be able to access detailed data sets. However, using the example of the National Health Service in England they argue that both assumptions can be challenged; furthermore that the public acceptability of third party access to detailed health care data sets is, at best, unclear. This is part of a larger issue surrounding reuse of personal and behavioral data: even if in most cases these data are only intended to be used in an anonymized and aggregated form to identify trends, or to improve and personalize services, the fact that their collection is now so routine and so extensive should make us question whether the regulatory system governing data collection, storage, and use is fit for purpose (Andrade, 2012). We should also be aware of the apparent tension between innovation and economic growth, and privacy and data protection regimes; and therefore the need to regulate the sector in ways that satisfy the varying needs of users, while nonetheless allowing service providers to innovate (Porcedda & De Filippi, 2012). In summary, big data promises powerful and emergent opportunities for academics and policymakers, enabling the generation of new, precise, and rapid insights into economic, social, and political practices and processes. Access to these newly opened up data will inevitably impact social research and policy making, and will also demand a re-examination of political science knowledge and theory. In closing, we mention the first article in this issue, “Social Science in the Era of Big Data” by Sandra González-Bailón (2013), which reviews and discusses more fully many of the things briefly touched on in this editorial. She considers the implications of the “data deluge” for social science research, and for the types of questions we can ask of the world we inhabit. She explains why, in spite of all the data, theory still matters to build credible stories of what the data reveal, and shows how this allows us to revisit old questions at the intersection of new technologies and disciplinary approaches. There is a clear and urgent need for academics to understand the potentials and challenges of big data for the public policymaking process; including a clear set of methodological, technical, theoretical, and ethical challenges and concerns. The article closes with a consideration of the policy implications of big data research, focusing on how it can help us improve communication and governance in a range of policy domains. This is something our conference certainly highlighted and sought to address. We have very much enjoyed selecting the papers presented in this special issue on big data; we hope you enjoy reading them. Helen Margetts Oxford Internet Institute, University of Oxford (helen.margetts@oii.ox.ac.uk) David Sutcliffe Oxford Internet Institute, University of Oxford

Full Text