Abstract

Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.

Highlights

  • With the rapid development of the world wide web(WWW), the internet has become the main source of information that can be accessed in the form of web articles [1]

  • Content extraction (CE) is a technique used to determine the correct part of an HTML document that contains the main content of a web document [4]

  • WORK In this paper, a visual block construction method based on page type conversion VB-PTC is proposed

Read more

Summary

Introduction

With the rapid development of the world wide web(WWW), the internet has become the main source of information that can be accessed in the form of web articles [1]. To improve the reliability of data collection, more and more researchers are relying on cloud processing and decision-making to find useful information from expanded knowledge sources [2], [3]. Content extraction (CE) is a technique used to determine the correct part of an HTML document that contains the main content of a web document [4]. Web pages contain the main content, but this information is surrounded by some additional information, such as anchor tags, advertisements, and various navigation information. Only the main content is very important, and any noise will have an adverse impact on the.

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.