Abstract

In recent years, the World Wide Web (WWW) has become a global data center, which permits people to store and distribute their information. The information in Web Pages may be related to be personal, official, commercial and business. The users of Web would like to access such information for their needs. Therefore, to use the Web data for any specific purpose, it is necessary to have techniques which will classify the Web Pages so that the suitable data available in Web Page are provided to users. This paper proposes a new technique for classification of Web Pages using level based classification and hierarchical indexing model based on predefined domains: Sports, Politics and education. The method works in two important phases: Training phase and Testing phase. During training phase the dynamic Feature Extraction and Knowledge Representation is performed. During testing phase the features extracted from the Web Pages are used for content matching for Classification. The technique comprises three steps namely: Dynamic Feature Extraction, Knowledge Representation and Classification for randomly distributed Web Pages. During Feature Extraction the important keywords are extracted from Headers and Paragraphs of Web Pages. The Frequency Occurrence of Key Words is determined and the frequency values are multiplied with weights so as to segregate the keywords at different priority levels. The Represented Knowledge is further used for content matching for classification of Web Pages. The percentage of belongingness of the webpage for each such category is calculated using Maximum Entropy Classifier. Maximum Entropy Classifier is considered due to its advantage in search based optimizations. The method is evaluated with three different categories of Web Page such as Sports, Politics and Education. The technique has achieved the Classification accuracy of 91% which is higher than conventional Classification technique.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call