Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content

Khushboo Khurana,M.B Chandak

doi:10.14569/ijacsa.2016.070555

Abstract

Large and continuously growing dynamic web content has created new opportunities for large-scale data analysis in the recent years. There is huge amount of information that the traditional web crawlers cannot access, since they use link analysis technique by which only the surface web can be accessed. Traditional search engine crawlers require the web pages to be linked to other pages via hyperlinks causing large amount of web data to be hidden from the crawlers. Enormous data is available in deep web that can be useful to gain new insight for various domains, creating need to access the information from the deep web by developing efficient techniques. As the amount of Web content grows rapidly, the types of data sources are proliferating, which often provide heterogeneous data. So we need to select Deep Web Data sources that can be used by the integration systems. The paper discusses various techniques that can be used to surface the deep web information and techniques for Deep Web Source Selection.

Highlights

Tremendous increase in collection of web content has created new opportunities for large-scale data analysis
In [7], an effective and scalable approach for selection of deep web source based on quality of data source is proposed for the deep web data integration system
Hidden Web content can be accessed by Deep Web Crawlers that can fill and submit forms to query the online databases for information extraction

Summary

INTRODUCTION

Tremendous increase in collection of web content has created new opportunities for large-scale data analysis. Most of the search engines access only the Surface Web, which is a part of web that can be discovered by following hyperlinks and downloading the snapshots of pages for including them in the search engine’s index [2]. Deep web consists of following types of content: Dynamic Data: Data that can only be accessed through the query interface they support. These interfaces are based on input attribute(s), and a user query involves specifying value(s) for these attributes. The information in the deep web is about 500 times larger than the surface web, with 7,500 Terabytes of data, across 200,000 deep web sites [6] This wealth of information is missed since the standard search engines cannot find the information generated by dynamic sites. There is a need to access the data that is deep by developing efficient techniques

TRADITIONAL WEB CRAWLER

ACCESSING THE DEEP WEB

DATA SOURCE SELECTION BASED ON QUALITY PARAMETER

DEEP WEB CRAWLER

WEB SOURCE VIRTUAL INTEGRATION

Data Mining on Deep Web

Clustering

Ontology Assisted Deep Web Search

Visual Approach

VIII. COMPARATIVE ANALYSIS

Send the web pages to virtual integration system

Limitations

Focused

URL de-duplication

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2016
Citations: 41	License type: cc-by

R Discovery Prime

R Discovery Prime

Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Efficient Approach for Knowledge Management Using Deep Web Information Retrieval System
Soniya Agrawal
IOSR Journal of Computer Engineering | VOL. 12
Soniya AgrawalSoniya Agrawal
01 Jan 2013
IOSR Journal of Computer Engineering | VOL. 12

Exploring and Analysing Surface, Deep, Dark Web and Attacks
Jabeen Sultana ... Abdul Khader Jilani
-
Jabeen Sultana, et. al.Jabeen Sultana ... Abdul Khader Jilani
01 Jan 2020
01 Jan 2020

An architecture for extracting information from hidden web databases using intelligent agent technology through reinforcement learning
Lohit Singh ... Dilip Kumar Sharma
-
Lohit Singh, et. al.Lohit Singh ... Dilip Kumar Sharma
01 Apr 2013
01 Apr 2013

The Design and Implementation of a Deep Web Architecture
...
-
, et. al. ...
16 Oct 2012
16 Oct 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications