Abstract

Today, deep web comprises of a large part of web contents. Because of this large volume of data, the technologies related to deep web have gained larger attention in recent years. Deep web mostly comprises of online domain specific databases, which are accessed by using web query interfaces. These highly relevant domain specific databases are more suitable for satisfying the information needs of the users. In order to make the extraction of relevant information easier, there is a need to classify the deep web databases into subject-specific self-descriptive categories. In this paper we present a novel training-less classification approach TODWEB based on common sense world knowledge (in the form of ontology or any external lexical resource) for the automatic deep web source classification; which will help in building highly scalable, domain focused and efficient semantic information retrieval systems (i.e. metasearch engine and search engine directories). One of the important aspects of this approach is the classification method which is completely training less and uses Wikipedia category network and domain-independent ontologies to analyze the semantics in the meta-information of the deep web sources. The large number of fine grained Wikipedia categories are employed to analyze semantic relatedness among concepts and finally the URL of deep web search source is mapped to the category hierarchy offered by Wikipedia. The experiments conducted on a collection of search sources shows that this approach results in a highly accurate and fine grained classification as compared to existing approaches, nearly identical to the results achieved by manual classification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.