Parallel Approach and Platform for Large-Scale WEB Data Extraction

Shen Yi,Shengsheng Shi,Chunfeng Yuan,Yihua Huang,Wu Wei,Haitao Wang

doi:10.1109/cbd.2013.24

Shen Yi, Shengsheng Shi + Show 4 more

https://doi.org/10.1109/cbd.2013.24

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

As the most popular information publishing platform, the Web contains a lot of valued information of interests to users or applications. Although a lot of data extraction techniques have been studied in the last decade, it is still far away from meeting the need of real data extraction. On the one hand, most of them cannot support the whole web information extraction process involving three stages: web page navigation, data extraction and data integration, On the other hand, they cannot support parallel data extraction process for large-scale web pages. In this paper, we propose a parallel approach and platform based on the Hadoop MapReduce for large-scale web data extraction. Our approach can perform the whole three-stage web data extraction process in parallel. Experimental results show that our approach is efficient and can achieve linear speedup.

Full Text