A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

Yuanyuan Zhang,Qinyan Zhang,Guanfu Jiang

doi:10.4304/jsw.9.3.641-647

Abstract

In order to implement the information fusion of electronic products, the widely adopted approach is to extract information from HTML structure of business Website with deeply data processing. However, modeling Web application is hard to be solved that the data in HTML is semi-formal which displayed as DOM (Document Object Model) tree when using XML schema to data analysis. How to understand and extract information is first to be researched. The general model Anchor-Hop considering the text property and label property is simple to handle this problem. Therefore, it has low effectiveness. This model is sensitive to the data of HTML structure, that if the website structure is slightly changed the issue of extraction accuracy is encountered. As a result, the extraction rules should be redefined because of the changed structure. In order to improve extraction efficiency, this paper proposed a DOM-based dynamic model Anchor-Hop-T information extraction model. The HTML tags including table, ol and ul can be searched and processed using XPath so that it is convenience to extract corresponding Anchor data block. Furthermore, the location of Hop point is considered as invariant, by which our new model based on Anchor and Hop point introduces more concepts for extracting information, such as Anchor data block, Anchor locating library and AH relevance value. Finally, we try to give out an experiment to demonstrate the applicability of our approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

Abstract

Talk to us

Similar Papers

More From: Journal of Software

Lead the way for us

Similar Papers

Web Information Extraction Technology Research Based on Ajax
Zhang Hengru ... Cui Chun
-
Zhang Hengru, et. al.Zhang Hengru ... Cui Chun
01 Jul 2011
01 Jul 2011

Two-phase Web site classification based on hidden Markov tree models
Yonghong Tian ... Jun Cheng
-
Yonghong Tian, et. al. Yonghong Tian ... Jun Cheng
01 Jan 2003
01 Jan 2003

Analysis of Enterprise Social Media Intelligence Acquisition Based on Data Crawler Technology
Lehe Yu ... Zhengxiu Gui
Entrepreneurship Research Journal | VOL. 11
Lehe Yu, et. al.Lehe Yu ... Zhengxiu Gui
22 Feb 2021
Entrepreneurship Research Journal | VOL. 11

Exploiting DOM Mutation for the Detection of Ad-injecting Browser Extension
Azreen Zaini ... Anazida Zainal
-
Azreen Zaini, et. al.Azreen Zaini ... Anazida Zainal
09 Sep 2018
09 Sep 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

Abstract

Talk to us

Similar Papers

More From: Journal of Software