VB-PTC: Visual Block Multi-Record Text Extraction Based on Sensor Network Page Type Conversion

Jibing Gong,Weixia Du,Huanhuan Li,Hongnian Wen,Hekai Zhang

doi:10.1109/access.2020.3024194

Jibing Gong, Weixia Du + Show 3 more

Open Access

PDF Available

https://doi.org/10.1109/access.2020.3024194

Copy DOI

Export

Save

Cite

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 3	License type: CC BY 4.0

Affiliation: Yanshan University, Shijiazhuang Tiedao University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.

Highlights

With the rapid development of the world wide web(WWW), the internet has become the main source of information that can be accessed in the form of web articles [1]
Content extraction (CE) is a technique used to determine the correct part of an HTML document that contains the main content of a web document [4]
WORK In this paper, a visual block construction method based on page type conversion VB-PTC is proposed

Summary

Introduction

With the rapid development of the world wide web(WWW), the internet has become the main source of information that can be accessed in the form of web articles [1]. To improve the reliability of data collection, more and more researchers are relying on cloud processing and decision-making to find useful information from expanded knowledge sources [2], [3]. Content extraction (CE) is a technique used to determine the correct part of an HTML document that contains the main content of a web document [4]. Web pages contain the main content, but this information is surrounded by some additional information, such as anchor tags, advertisements, and various navigation information. Only the main content is very important, and any noise will have an adverse impact on the.

Objectives

Methods

Findings

Conclusion