A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Umamageswari Kumaresan,Kalpana Ramanujam

doi:10.4018/ijirr.290830

Umamageswari Kumaresan, Kalpana Ramanujam

Open Access

PDF Available

https://doi.org/10.4018/ijirr.290830

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.

Highlights

Web Scraping involves extracting enormous amount of data embedded in semi-structured HTML pages
Let n be the number of nodes in the Semantic Feature Tree, let m be the number of child nodes for each n in SFT and m
The drawbacks associated with these approaches are their dependency on string matching or tree matching makes them computationally expensive, inability to perform extraction if only a single source page is available, missing attributes, use of same template for formatting different attributes or use of alternate formatting for same attribute remarkably degrades the accuracy of extraction

Summary

INTRODUCTION

Web Scraping involves extracting enormous amount of data embedded in semi-structured HTML pages. Many commercial tools such as Lixto (Baumgartner, Gatterbauer, & Gottlob, 2009), import.io (https:// www.import.io/), Connotate (https://www.connotate.com/) are available for web data extraction, their usage requires understanding of site map, manual selection of extraction targets. Many automatic approaches such as ExAlg (Arasu & Garcia-Molina, 2003), RoadRunner (Crescenzi, Mecca, & Merialdo, 2002), FiVaTech (Kayed & Chang, 2010) and Trinity (Sleiman & Corchuelo, 2014) exist in the literature.

RELATED WORKS

P R F1 P R F1

Findings

CONCLUSION AND FUTURE DIRECTIONS

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: International Journal of Information Retrieval Research

Lead the way for us

Journal: International Journal of Information Retrieval Research	Publication Date: Nov 4, 2021
License type: CC BY 3.0

Similar Papers

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

Intelligent Automated Navigation through the Deep Web
Vicente Luque Centeno ... Norberto Fernández García
-
Vicente Luque Centeno, et. al.Vicente Luque Centeno ... Norberto Fernández García
01 Jan 2004
01 Jan 2004

A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques
Ali I El-Desouky ... Hesham A Ali
-
Ali I El-Desouky, et. al.Ali I El-Desouky ... Hesham A Ali
01 Dec 2006
01 Dec 2006

Data Extraction from Semi-structured Web Pages by Clustering
Le Bao Vuong ... Mengjie Zhang
-
Le Bao Vuong, et. al.Le Bao Vuong ... Mengjie Zhang
01 Dec 2006
01 Dec 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: International Journal of Information Retrieval Research