TODWEB

Umara Noor,Azhar Rauf,Zahid Rashid

doi:10.1145/2095536.2095569

Abstract

Today, deep web comprises of a large part of web contents. Because of this large volume of data, the technologies related to deep web have gained larger attention in recent years. Deep web mostly comprises of online domain specific databases, which are accessed by using web query interfaces. These highly relevant domain specific databases are more suitable for satisfying the information needs of the users. In order to make the extraction of relevant information easier, there is a need to classify the deep web databases into subject-specific self-descriptive categories. In this paper we present a novel training-less classification approach TODWEB based on common sense world knowledge (in the form of ontology or any external lexical resource) for the automatic deep web source classification; which will help in building highly scalable, domain focused and efficient semantic information retrieval systems (i.e. metasearch engine and search engine directories). One of the important aspects of this approach is the classification method which is completely training less and uses Wikipedia category network and domain-independent ontologies to analyze the semantics in the meta-information of the deep web sources. The large number of fine grained Wikipedia categories are employed to analyze semantic relatedness among concepts and finally the URL of deep web search source is mapped to the category hierarchy offered by Wikipedia. The experiments conducted on a collection of search sources shows that this approach results in a highly accurate and fine grained classification as compared to existing approaches, nearly identical to the results achieved by manual classification.

Full Text