Sketching-Din Elimination of Web Page

Sivakumar Sivakumar

doi:10.3844/jcssp.2011.1888.1893

Abstract

Problem statement: The web content mining used to access lot of web pages, mining of web contents aims to extort positive information or awareness. Approach: There are several type of Web contents which can suggest valuable information to users are accessible in the Web, for instance graphical data, Extensible Markup Language documents, Hyper Text Markup Language documents and simple text. Here, only element of the information is useful for a testing purpose and the remaining information are noises. Results: In this research study, we propose an approach for removing the noises from a given web page which will get better the presentation of web content mining. At first, the web page information is divided into various blocks. Conclusion: From which, the duplicate blocks are removed using sketching. The performance of the proposed approach and results ensure the effectiveness of the proposed approach in classify the main blocks.

Highlights

The World Wide Web is quickly promising as a significant standard for transacting trade as well as for the distribution of information allied to a large collection of topics for example industry, administration, Games
Web mining is separated into three category: Web Content Mining (WCM), Web Usage Mining (WUM) and Web Structural Mining (WSM)
For finding the duplicate blocks, we have used fingerprint method proposed by Charikar, in this sketching algorithm is used to remove near duplicate content in the web page

Summary

Introduction

The World Wide Web is quickly promising as a significant standard for transacting trade as well as for the distribution of information allied to a large collection of topics for example industry, administration, Games. Web content mining considers different kinds of data such as: images, audio, video and texts (e.g., web documents and free texts). There are different kinds of Web content which can provide useful information to users, for example multimedia data, structured (i.e., XML documents), semi-structured (i.e., HTML documents) and unstructured data (i.e., plain text).

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sketching-Din Elimination of Web Page

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science

Lead the way for us

Journal: Journal of Computer Science	Publication Date: Dec 1, 2011
License type: cc-by

Similar Papers

Effectual Web Content Mining using Noise Removal from Web Pages
P Sivakumar
Wireless Personal Communications | VOL. 84
P SivakumarP Sivakumar
24 Apr 2015
Wireless Personal Communications | VOL. 84

Noise Reduction and Content Retrieval from Web Pages
Surabhi Lingwal
International Journal of Computer Applications | VOL. 73
Surabhi LingwalSurabhi Lingwal
26 Jul 2013
International Journal of Computer Applications | VOL. 73

Web Mining and Search Engines

-

01 Apr 2019
01 Apr 2019

Cleaning Web Pages for Effective Web Content Mining
Jing Li ... C I Ezeife
-
Jing Li, et. al.Jing Li ... C I Ezeife
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sketching-Din Elimination of Web Page

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science