Rule Learning for Feature Values Extraction from HTML Product Information Sheets

Costin Bădică,Amelia Bădică

doi:10.1007/978-3-540-30504-0_4

Abstract

AbstractThe Web is now a huge information repository with a rich semantic structure that, however, is primarily addressed to human understanding rather than automated processing by a computer. The problem of collecting product information from the Web and organizing it in an appropriate way for automated machine processing is a primary task of software shopping agents and has received a lot of attention during the last years. In this paper we assume that product information is represented as a set of feature-value pairs contained in an HTML product information sheet that is usually formatted using HTML tables. The paper presents a technique for learning extraction rules of product information from such product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the fly from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually for a few information items by a human user, a general-purpose inductive learner (we have used FOIL in our experiments) can learn extraction rules that will be further applied to the current and other product information sheets to automatically extract other items. The input to the learning algorithm is a relational description of the HTML document tree that defines the HTML tree nodes types and the relationships between them. The approach is demonstrated with appropriate examples, experimental results, and software tools.KeywordsText ElementTraining DocumentExtraction TaskBackground RelationDocument TreeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Rule Learning for Feature Values Extraction from HTML Product Information Sheets

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Attention Distribution of Current Key Investor Documents — Standardization as a Long-Term Goal of the PRIIPS Regulation
Patrick Münchhalfen ... Robert Gaschler
SSRN Electronic Journal | VOL. -
Patrick Münchhalfen, et. al.Patrick Münchhalfen ... Robert Gaschler
22 May 2020
SSRN Electronic Journal | VOL. -

Attention Distribution of Current Key Investor Documents: Standardization as a Long-Term Goal of the PRIIP Regulation
P Münchhalfen ... R Gaschler
Journal of Consumer Policy | VOL. 44
P Münchhalfen, et. al.P Münchhalfen ... R Gaschler
25 Jan 2021
Journal of Consumer Policy | VOL. 44

Thermolabile drugs stability faced with an accidental interruption in the cold chain
I Ricote-Lobera ... B García-Díaz
Farmacia hospitalaria : organo oficial de expresion cientifica de la Sociedad Espanola de Farmacia Hospitalaria | VOL. 38
I Ricote-Lobera, et. al.I Ricote-Lobera ... B García-Díaz
01 May 2014
Farmacia hospitalaria : organo oficial de expresion cientifica de la Sociedad Espanola de Farmacia Hospitalaria | VOL. 38

Counselling regarding paternal exposures: Can we do better?
Helen Elizabeth Ritchie ... Diana Joy Oakes
The Australian & New Zealand journal of obstetrics & gynaecology | VOL. 57
Helen Elizabeth Ritchie, et. al.Helen Elizabeth Ritchie ... Diana Joy Oakes
09 Mar 2017
The Australian & New Zealand journal of obstetrics & gynaecology | VOL. 57

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rule Learning for Feature Values Extraction from HTML Product Information Sheets

Abstract

Talk to us

Similar Papers