Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives

Ian Milligan

doi:10.3366/ijhac.2016.0161

Abstract

Contemporary and future historians need to grapple with and confront the challenges posed by web archives. These large collections of material, accessed either through the Internet Archive's Wayback Machine or through other computational methods, represent both a challenge and an opportunity to historians. Through these collections, we have the potential to access the voices of millions of non-elite individuals (recognizing of course the cleavages in both Web access as well as method of access). To put this in perspective, the Old Bailey Online currently describes its monumental holdings of 197,745 trials between 1674 and 1913 as the “largest body of texts detailing the lives of non-elite people ever published.” GeoCities.com, a platform for everyday web publishing in the mid-to-late 1990s and early 2000s, amounted to over thirty-eight million individual webpages. Historians will have access, in some form, to millions of pages: written by everyday people of various classes, genders, ethnicities, and ages. While the Web was not a perfect democracy by any means – it was and is unevenly accessed across each of those categories – this still represents a massive collection of non-elite speech. Yet a figure like thirty-eight million webpages is both a blessing and a curse. We cannot read every website, and must instead rely upon discovery tools to find the information that we need. Yet these tools largely do not exist for web archives, or are in a very early state of development: what will they look like? What information do historians want to access? We cannot simply map over web tools optimized for discovering current information through online searches or metadata analysis. We need to find information that mattered at the time, to diverse and very large communities. Furthermore, web pages cannot be viewed in isolation, outside of the networks that they inhabited. In theory, amongst corpuses of millions of pages, researchers can find whatever they want to confirm. The trick is situating it into a larger social and cultural context: is it representative? Unique? In this paper, “Lost in the Infinite Archive,” I explore what the future of digital methods for historians will be when they need to explore web archives. Historical research of periods beginning in the mid-1990s will need to use web archives, and right now we are not ready. This article draws on first-hand research with the Internet Archive and Archive-It web archiving teams. It draws upon three exhaustive datasets: the large Web ARChive (WARC) files that make up Wide Web Scrapes of the Web; the metadata-intensive WAT files that provide networked contextual information; and the lifted-straight-from-the-web guerilla archives generated by groups like Archive Team. Through these case studies, we can see – hands-on – what richness and potentials lie in these new cultural records, and what approaches we may need to adopt. It helps underscore the need to have humanists involved at this early, crucial stage.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives

Abstract

Talk to us

Similar Papers

More From: International Journal of Humanities and Arts Computing

Lead the way for us

Journal: International Journal of Humanities and Arts Computing	Publication Date: Mar 1, 2016
Citations: 38

Similar Papers

Digital humanities and web archives: Possible new paths for combining datasets
Niels Brügger
International Journal of Digital Humanities | VOL. 2
Niels BrüggerNiels Brügger
28 May 2021
International Journal of Digital Humanities | VOL. 2

Web Archive Search as Research: Methodological and Theoretical Implications
Anat Ben-David ... Hugo Huurdeman
Alexandria: The Journal of National and International Library and Information Issues | VOL. 25
Anat Ben-David, et. al.Anat Ben-David ... Hugo Huurdeman
01 Aug 2014
Alexandria: The Journal of National and International Library and Information Issues | VOL. 25

To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages
John Berlin ... Mat Kelly
ACM Transactions on the Web | VOL. 17
John Berlin, et. al.John Berlin ... Mat Kelly
11 Jul 2023
ACM Transactions on the Web | VOL. 17

WARCreate
Mat Kelly ... Michele C Weigle
-
Mat Kelly, et. al.Mat Kelly ... Michele C Weigle
10 Jun 2012
10 Jun 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives

Abstract

Talk to us

Similar Papers

More From: International Journal of Humanities and Arts Computing