Dexter

Disheng Qiu,Divesh Srivastava,Xin Luna Dong,Luciano Barbosa,Yanyan Shen

doi:10.14778/2831360.2831372

Abstract

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, D exter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46A M product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dexter

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Sep 1, 2015
Citations: 72

Similar Papers

Extracting attribute-value pairs from product specifications on the web
Petar Petrovski ... Christian Bizer
-
Petar Petrovski, et. al.Petar Petrovski ... Christian Bizer
23 Aug 2017
23 Aug 2017

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

Zone Specific Index Based Model for Data Storage Identification in Search Query Processing
Aditi Bankura ... Anirban Kundu
-
Aditi Bankura, et. al.Aditi Bankura ... Anirban Kundu
01 Jan 2020
01 Jan 2020

Zone based Indexing Model for Database Identification in Search Query Processing
Aditi Bankura ... Sutirtha Guha
-
Aditi Bankura, et. al.Aditi Bankura ... Sutirtha Guha
05 Sep 2020
05 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dexter

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment