Automatic discovery of Web Query Interfaces using machine learning techniques

Heidy M. Marin-Castro,Ivan Lopez-Arevalo,Victor J. Sosa-Sosa,Jose F. Martinez-Trinidad

doi:10.1007/s10844-012-0217-4

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic discovery of Web Query Interfaces using machine learning techniques

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent Information Systems

Lead the way for us

Journal: Journal of Intelligent Information Systems	Publication Date: Aug 23, 2012
Citations: 9

Similar Papers

A strategy for identification of Web query interfaces using supervised learning
Heidy M Marin-Castro ... Victor J Sosa-Sosa
-
Heidy M Marin-Castro, et. al.Heidy M Marin-Castro ... Victor J Sosa-Sosa
01 Oct 2011
01 Oct 2011

WebQuIn-LD: A Method of Integrating Web Query Interfaces Based on Linked Data
Julio Hernandez ... Heidy M Marin-Castro
IEEE Access | VOL. 9
Julio Hernandez, et. al.Julio Hernandez ... Heidy M Marin-Castro
01 Jan 2020
IEEE Access | VOL. 9

Automatic Identification of Web Query Interfaces
Heidy M Marin-Castro ... Victor J Sosa-Sosa
-
Heidy M Marin-Castro, et. al.Heidy M Marin-Castro ... Victor J Sosa-Sosa
01 Jan 2010
01 Jan 2010

A hierarchical approach to model web query interfaces for web source integration
Eduard C Dragut ... Thomas Kabisch
Proceedings of the VLDB Endowment | VOL. 2
Eduard C Dragut, et. al.Eduard C Dragut ... Thomas Kabisch
01 Aug 2009
Proceedings of the VLDB Endowment | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic discovery of Web Query Interfaces using machine learning techniques

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent Information Systems