Effective Web data extraction with standard XML technologies

Jussi Myllymaki

doi:10.1145/371920.372183

Abstract

the problem of Web data extraction and describe an XML-based methodology whose goal extends far beyond simple screen scraping. An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable. In this paper we describe ANDES, a software framework that makes significant advances in solving these problems and provides a platform for building a production -quality Web data extraction process. Key aspects of ANDES are that it uses XML technologies for data extraction, including XHTML and XSLT, and provides access to the deep Web.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Effective Web data extraction with standard XML technologies

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Effective Web data extraction with standard XML technologies
Jussi Myllymaki
Computer Networks | VOL. 39
Jussi MyllymakiJussi Myllymaki
11 Apr 2002
Computer Networks | VOL. 39

Parallel Approach and Platform for Large-Scale WEB Data Extraction
Shen Yi ... Chunfeng Yuan
-
Shen Yi, et. al.Shen Yi ... Chunfeng Yuan
01 Dec 2013
01 Dec 2013

Web data extraction based on structural similarity
Zhao Li ... Aixin Sun
Knowledge and Information Systems | VOL. 8
Zhao Li, et. al.Zhao Li ... Aixin Sun
02 Feb 2005
Knowledge and Information Systems | VOL. 8

The Significance of using Data Extraction Methods for an Effective Big Data Mining Process
Manish Sharma ... Richa Gupta
-
Manish Sharma, et. al.Manish Sharma ... Richa Gupta
03 Mar 2023
03 Mar 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Effective Web data extraction with standard XML technologies

Abstract

Talk to us

Similar Papers