DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Oviliani Yenty Yuliana,Chia-Hui Chang

doi:10.1007/s10489-019-01499-0

Abstract

In this paper, we consider the problem of full schema induction from either multiple list pages or singleton pages with the same template. Existing approaches do not work well for this problem because they use fixed abstraction schemes that are suitable for data-rich detection, but they are not appropriate for small records and complex data found in other sections. We propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE for short). We define the Content Equivalence Class (CEC) and Typeset Equivalence Class (TEC) based on leaf node content. We then combine HTML attributes (i.e., id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specific to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (denoted by FD), and a 0.936 F1 measure for recordset data extraction (denoted by FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD= 0.660), TEX (FD= 0.454 and FS= 0.549), RoadRunner (FD= 0.396 and FS= 0.330), and UWIDE (FD= 0.260 and FS= 0.081).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Abstract

Talk to us

Similar Papers

More From: Applied Intelligence

Lead the way for us

Journal: Applied Intelligence	Publication Date: Jul 22, 2019
Citations: 3

Similar Papers

Parallel Approach and Platform for Large-Scale WEB Data Extraction
Shen Yi ... Yihua Huang
-
Shen Yi, et. al.Shen Yi ... Yihua Huang
01 Dec 2013
01 Dec 2013

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model
Shengsheng Shi ... Wu Wei
-
Shengsheng Shi, et. al.Shengsheng Shi ... Wu Wei
01 Jan 2013
01 Jan 2013

Trends in web data extraction using machine learning
Sudhir Kumar Patnaik ... C Narendra Babu
Web Intelligence | VOL. 19
Sudhir Kumar Patnaik, et. al.Sudhir Kumar Patnaik ... C Narendra Babu
16 Dec 2021
Web Intelligence | VOL. 19

A novel alignment algorithm for effective web data extraction from singleton-item pages
Oviliani Yenty Yuliana ... Chia-Hui Chang
Applied Intelligence | VOL. 48
Oviliani Yenty Yuliana, et. al.Oviliani Yenty Yuliana ... Chia-Hui Chang
15 Jun 2018
Applied Intelligence | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Abstract

Talk to us

Similar Papers

More From: Applied Intelligence