Efficiency Improvement Approach of Deep Web Data Extraction

Mona Nasr,Mohamed Thabet,Hanan Fahmy

doi:10.1109/icces48960.2019.9068134

Abstract

Deep Web is an important topic of research. According to the deep web pages' complicated structure, extracting content is a very challenging issue. In this paper a framework for efficiently discovery deep web data records is proposed. The proposed framework is able to perform crawling and fetching relevant pages related to user's text query. To retrieve the relevant pages this paper proposes a similarity method based on the improved weighting function (ITF-IDF). This framework utilizes the web page's visual features to obtain data records rather than analyze the source code of HTML. To accurately retrieve the data records, an approach called layout tree is exploited. The proposed framework uses Noise Filter (NSFilter) algorithm to eliminate all noise like header, footer, ads and unnecessary content. Data records are defined as a similar layout visual blocks. To cluster the visual blocks with similar layout, this paper proposes a method based on appearance similarity and similar shape and coordinate feature (SSC). The experiment results illustrate that the framework being proposed is better than previous data extraction works.

Full Text