Abstract

Problem statement: The web content mining used to access lot of web pages, mining of web contents aims to extort positive information or awareness. Approach: There are several type of Web contents which can suggest valuable information to users are accessible in the Web, for instance graphical data, Extensible Markup Language documents, Hyper Text Markup Language documents and simple text. Here, only element of the information is useful for a testing purpose and the remaining information are noises. Results: In this research study, we propose an approach for removing the noises from a given web page which will get better the presentation of web content mining. At first, the web page information is divided into various blocks. Conclusion: From which, the duplicate blocks are removed using sketching. The performance of the proposed approach and results ensure the effectiveness of the proposed approach in classify the main blocks.

Highlights

  • The World Wide Web is quickly promising as a significant standard for transacting trade as well as for the distribution of information allied to a large collection of topics for example industry, administration, Games

  • Web mining is separated into three category: Web Content Mining (WCM), Web Usage Mining (WUM) and Web Structural Mining (WSM)

  • For finding the duplicate blocks, we have used fingerprint method proposed by Charikar, in this sketching algorithm is used to remove near duplicate content in the web page

Read more

Summary

Introduction

The World Wide Web is quickly promising as a significant standard for transacting trade as well as for the distribution of information allied to a large collection of topics for example industry, administration, Games. Web content mining considers different kinds of data such as: images, audio, video and texts (e.g., web documents and free texts). There are different kinds of Web content which can provide useful information to users, for example multimedia data, structured (i.e., XML documents), semi-structured (i.e., HTML documents) and unstructured data (i.e., plain text).

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call