Abstract
As the most popular information publishing platform, the Web contains a lot of valued information of interests to users or applications. Although a lot of data extraction techniques have been studied in the last decade, it is still far away from meeting the need of real data extraction. On the one hand, most of them cannot support the whole web information extraction process involving three stages: web page navigation, data extraction and data integration, On the other hand, they cannot support parallel data extraction process for large-scale web pages. In this paper, we propose a parallel approach and platform based on the Hadoop MapReduce for large-scale web data extraction. Our approach can perform the whole three-stage web data extraction process in parallel. Experimental results show that our approach is efficient and can achieve linear speedup.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have