WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora.

Miguel Callón,David Ruano-Ordás,Florentino Fdez-Riverola,Rosalía Laza,Jose Méndez,Jorge Fdez-Glez,Reyes Pavón

doi:10.3390/s18010016

Abstract

In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background.

Highlights

Nowadays, the World Wide Web (WWW) has become an essential source of information in almost every area of knowledge
To foster the research in the area, the authors made publicly available several feature sets and source code containing the temporal attributes of eight .uk crawl snapshots, including uk2007 together with the Web Spam Challenge features for the labeled part of clueweb09 corpus
With the aim of giving specific support to all the singularities that characterize research activities working with this type of information, in this work we present the design, implementation activities working with this type of information, in this work we present the design, implementation and evaluation of WARCProcessor, a platform-independent integrative tool providing specific and evaluation of WARCProcessor, a platform-independent integrative tool providing specific support support to scientists that need to perform experiments in the field of web spam research

Summary

Introduction

The World Wide Web (WWW) has become an essential source of information in almost every area of knowledge. To foster the research in the area, the authors made publicly available several feature sets and source code containing the temporal attributes of eight .uk crawl snapshots, including uk2007 together with the Web Spam Challenge features for the labeled part of clueweb corpus On this occasion, the authors had to process different corpora stored in incompatible formats. Keeping all of the above in mind (i.e., existing available corpora and specific preprocessing needs), the following key features were identified as essential to implement a powerful yet flexible corpus management software helping to ensure reproducible research [38] and giving an adequate support to the specific requirements of web spam researchers: (i) integration of available information previously classified from different data sources (e.g., backlists, whitelists, existing corpora, etc.);.

WARCProcessor

WARCProcessor Workflow

Detailed

Design and and Implementation

Execution Modes

General

Specific data configuration in

Snippet

System Evaluation

Findings

Conclusions and Further Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors (Basel, Switzerland)	Publication Date: Dec 22, 2017
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Similar Papers

Managing the Integration Challenge between Corporate Environmental Strategy and Environmental Management Accounting: Perspectives from Sri Lanka

-

02 Oct 2020
02 Oct 2020

Google Penguin: Evasion in Non-English Languages and a New Classifier
Abdulrahman Alarifi ... Ahmad Alkhaledi
-
Abdulrahman Alarifi, et. al.Abdulrahman Alarifi ... Ahmad Alkhaledi
01 Dec 2013
01 Dec 2013

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution
De Wang ... Calton Pu
International Journal of Cooperative Information Systems | VOL. 23
De Wang, et. al.De Wang ... Calton Pu
01 Jun 2014
International Journal of Cooperative Information Systems | VOL. 23

From electronic health records to clinical management systems: how the digital transformation can support healthcare services.
Carlo Barbieri ... Flavio Mari
Clinical Kidney Journal | VOL. 16
Carlo Barbieri, et. al.Carlo Barbieri ... Flavio Mari
13 Jul 2023
Clinical Kidney Journal | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)