Data extraction from web pages based on structural-semantic entropy

Xiaoqing Zheng,Yinsheng Li,Yiling Gu

doi:10.1145/2187980.2187991

Abstract

Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data extraction from web pages based on structural-semantic entropy

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas
C I Ezeife ... Bindu Peravali
-
C I Ezeife, et. al.C I Ezeife ... Bindu Peravali
01 Jan 2015
01 Jan 2015

A Web Content Suggestion System for Distance Learning
...
-
, et. al. ...
01 Sep 2006
01 Sep 2006

An ontology-based approach for web information extraction
Ismail Jellouli ... Mohammed El Mohajir
-
Ismail Jellouli, et. al.Ismail Jellouli ... Mohammed El Mohajir
01 May 2011
01 May 2011

The GCN web page for real-time GRB information: Locations, intensities, fluences and light curves
S D Barthelmy ... P Butterworth
-
S D Barthelmy, et. al.S D Barthelmy ... P Butterworth
01 Jan 1998
01 Jan 1998

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data extraction from web pages based on structural-semantic entropy

Abstract

Talk to us

Similar Papers