Abstract

With its seemingly limitless scope, the World Wide Web promises enormous advantages, along with enormous problems, to researchers who seek to use it as a source of data. Websites change continually and a high level of flux makes it challenging to capture a snapshot of the web, or even a cross-section of a small subset of the web. A web archive, such as those at the Internet Archive, promises to store and deliver repeated cross-sections of the entire web, and it also offers the potential for longitudinal analysis. Whether this potential is realized depends on the extent to which the archive has truly captured the web. Therefore, a crucial question for Internet researchers is: ‘How good are the archival data?’ We ask if there are systematic biases in the Internet Archive, using a case study to address this question. Specifically, we are interested in whether biases exist in the British websites stored in the Internet Archive data. We find that the Internet Archive contains a surprisingly small subset, about 24%, of the webpages of the website that we use for our case study (the travel site, TripAdvisor). Furthermore, the subset of data we found in the Internet Archive appears to be biased and is not a random sample of the webpages on the site. The archived data we examine has a bias toward prominent webpages. This bias could create serious problems for research using archived websites.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call