Abstract

To obtain the target webpages from many webpages, we proposed a Method for Filtering Pages by Similarity Degree based on Dynamic Programming (MFPSDDP). The method needs to use one of three same relationships proposed between two nodes, so we give the definition of the three same relationships. The biggest innovation of MFPSDDP is that it does not need to know the structures of webpages in advance. First, we address the design ideas with queue and double threads. Then, a dynamic programming algorithm for calculating the length of the longest common subsequence and a formula for calculating similarity are proposed. Further, for obtaining detailed information webpages from 200,000 webpages downloaded from the famous website “www.jd.com”, we choose the same relationship Completely Same Relationship (CSR) and set the similarity threshold to 0.2. The Recall Ratio (RR) of MFPSDDP is in the middle in the four filtering methods compared. When the number of webpages filtered is nearly 200,000, the PR of MFPSDDP is highest in the four filtering methods compared, which can reach 85.1%. The PR of MFPSDDP is 13.3 percentage points higher than the PR of a Method for Filtering Pages by Containing Strings (MFPCS).

Highlights

  • The purpose of filtering webpages is to obtain target webpages in many webpages

  • Some studies traverse the XML tree to get the sequence according to Depth First Search (DFS) or Breadth First Search (BFS), and covert the similarity calculation of two trees to calculate the longest length of the common subsequence of the two sequences [13,14,15]

  • Based on three same relationships proposed between two nodes, we give the algorithm of MFPSDDP

Read more

Summary

Introduction

The purpose of filtering webpages is to obtain target webpages in many webpages. The filtered webpages are non-targeted webpages, pornographic webpages, etc. Filtering methods need to be proposed and developed. Some existing studies have proposed many methods to filter webpages [1,2]. Among the existing filtering methods, some are based on structure [1,2]. If programmers use filtering methods based on structure, programmers should know part of structures of webpages in advance. We propose a new filtering method based on structure, called a Method for Filtering Pages by Similarity Degree based on Dynamic Programming (MFPSDDP). Compared with other filtering methods based on structure, the biggest innovation of MFPSDDP is that it does not need to know the structures of webpages in advance. MFPSDDP has better accuracy and classifies webpages according to the similarity degree of the structures between two webpages. Programmers should choose a same relationship among the three relationships that the same relationship leads to the highest accuracy of filtering methods, without concern for the specific structure of webpages

Related Works
Filtering Methods Based on URI
Filtering Methods Based on Contents
Filtering Methods Based on Structure
Filtering Methods Based on Autonomous Learning
Algorithm of MFPSDDP
Same Relationship between Two Nodes
Double Thread Design
Experimental Analysis
Comparison with Other Methods
Filtering Method MFPCS
Filtering Method Main Configuration
MFPFiLltRering Method
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call