Abstract

These are derivative files generated by the Web Archives for Longitudinal Knowledge (WALK) project, which ran between 2016 and 2018. WALK was an interdisciplinary project spearheaded by scholars at York University, the University of Waterloo, and the University of Alberta. The project's goal was to bring together major Canadian web archive holdings and provide researcher access to search indexes and derivative files, including plain text, network diagrams, and domain frequency information. These will be useful to digital humanists who want to work with text at scale or the hyperlink networks of large parts of the archived Web. Six universities participated: the University of Toronto, University of Alberta, University of Victoria, University of Winnipeg, Dalhousie University, and Simon Fraser University. These files reflect the state of their public web archives in late-2017 to mid-2018. Each xz file contains: derivative files for a given collection, a GraphML file which you can load with Gephi (it will not have any basic layouts or transformations done to it, requiring you to do so manually), a csv file that explains the distribution of domains within the web archive, and a txt file that contains the plain text extracted from HTML documents within the web archive. You can find the crawl date, full URL, and the plain text of each page within the txt file. It may also contain a GEXF file which you can load with Gephi. It will have a basic layout courtesy of our GraphPass program, allowing you to see major nodes and communities in the network. This project has evolved into the Archives Unleashed Project. Information on Archives Unleashed and the WALK project can be found at https://archivesunleashed.org and on our blog at https://news.archivesunleashed.org.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.