Abstract

Web page, a kind of semi-structured document, includes a lot of additional attribute content besides text information. Traditional web page classification technology is mostly based on text classification methods. They ignore the additional attribute information of web page text. We propose WEB-GNN, an approach for Web page classification. There are two major contributions to this work. First, we propose a web page graph representation method called W2G that reconstructs text nodes into graph representation based on text visual association relationship and DOM-tree hierarchy relationship and realizes the efficient integration of web page content and structure. Our second contribution is to propose a web page classification method based on graph convolutional neural network. It takes the web page graph representation as to the input, integrates text features and structure features through graph convolution layer, and generates the advanced webpage feature representation. Experimental results on the Web-black dataset suggest that the proposed method significantly outperforms text-only method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.