Smart Crawler for Harvesting Deep web with Multi-Classification

Ajay Khare,Faruk Kazi,Ashwini Dalvi

doi:10.1109/icccnt49239.2020.9225369

Abstract

In recent era data available on the internet is playing a vital role. According to research, most precious and valuable data is present in the deep web so interest in techniques to efficiently site invisible web is increasing. The challenges to extract the deep web are requirement of huge volume of resources, dynamic nature of the deep web, coverage of a wider area of deep web and higher efficiency of collected results from deep web with accuracy. Along with all the above challenges, user demand of privacy and identity is to be maintained. In this paper we proposed a smart crawler that efficiently searches the deep web and avoids visiting irrelevant pages. A smart crawler starts crawling from the center page of seed URL and goes on crawling till the last link available. The crawler has an ability to separate active and inactive links based on requests to sever of hyperlink. The crawler also contains text-based site classifier using neural network and natural language processing as Term Frequency Inverse Document Frequency and Bag of Words with supervised machine learning techniques as logistic regression, support vector machine and naive bayes. Also HTML tags are extracted from hyperlinks along with data which plays a huge role in data analysis and all this is separately saved in a centralized database. Our experimental results with efficient link reaping rate and classification show higher accuracy compared to different crawlers.

Full Text