Abstract

Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.

Highlights

  • Web page topic classification is critical for website management and information retrieval

  • To solve the problem of classification efficiency, this paper proposed a web page classification method based on meta tag text such as the Title and the Description

  • In order to reduce the negative impact of these injected data on data quality, a data cleaning method based on the ratio of sensitive words is designed to recognize and clean noisy text in the meta tags

Read more

Summary

Introduction

Web page topic classification is critical for website management and information retrieval. Erefore, the web page meta tag text probably contains noise information unrelated to the web page’s theme. E problems above make it more challenging to classify the web pages To solve those problems, this paper focuses on classifying the web pages using meta tag text including the Description and the Title, to save computing and storage resources. Classification experiments are carried out in the data set proposed in this paper, and the results show that our method is effective and less affected by data imbalance. (1) A method for topic classification using the meta tag text tags on the web page and deep learning method is proposed, and it has proven the feasibility and effectiveness.

Related Works
Classification Framework
Text Noise Process
Experiment
Method
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.