A Vision-Based Approach for Deep Web Form Extraction

Jiachen Pu,Jin Liu,Jin Wang

doi:10.1007/978-981-10-5041-1_111

Abstract

The World Wide Web is a large source of information that contains data in either Surface Web or Deep Web. Compared with the data in the Surface Web, the Deep Web contains a greater amount of structured data with higher quality, but it is difficult to use directly. Studies in this field have revealed some methods for Deep Web Form Extraction, they may fall into the following categories which are HTML-based, vision-based, ontology-based, ML-based, NLP-based and so on. This paper try to combine the DOM tree and the convolutional neural network together and then find out the form in the Web page. This paper proposed a vision-based method VBF, which figures out the form from the Web page through the acquisition of the HTML code and screenshots of Web pages, establishment of the DOM tree and the calculation of the neural network and form recognition, matching, and generation.

Full Text