From the past few years, there is an exponential increase in one of the most popular technologies of the modern era called as Internet of Things (IoT). In IoT, various objects perform the tasks of sensing, communication, and computation for providing uninterrupted services (e.g., e-health, e-transportation, security access, etc.) to the end users. In this era, Cognitive Internet of Things (CIoT) is an another paradigm of IoT developed to enhance the capabilities of intelligence in IoT objects where these objects can take independent decisions in any environment. IoT follows the service oriented architecture (SOA), in which the application layer is the topmost layer. It enables the IoT objects to interact with the other objects located across the globe. The power of learning, thinking, and understanding by these objects, can make the information access more accurate and reliable but Web spam is one of the challenges while accessing information from the web. It has been observed from the literature review that search engines are preferred mostly by the people for accessing information. The efficient ranking by the search engines can reduce the computational cost of information exchange by IoT objects. Search engines should be able to prevent the spam from being injected into the web. But, the existing techniques for this problem target in finding the spam after its occurrence in search engine result pages. So, in this proposal, we present an intelligent cognitive spammer framework, Cognitive spammer, which eliminates the spam pages during the web page rank score calculation by search engines. The framework update the Google’s ranking algorithm, PageRank in such a way that it automatically prevents link spam by considering the link structure of web for rank score computation. The updated PageRank algorithm provided the better ranking of web pages. The proposed framework is validated with the WEBSPAM-UK2007 dataset. Before processing, the dataset is preprocessed with a new technique, called as ‘Split by Over-sampling and Train by Under-fitting’ to remove the trade off between imbalanced instances of target class. After data cleaning, we applied machine learning techniques (Bagged model, Boosted linear model, etc) with the web page features to make accurate predictions. The detection classifiers only consider the link features of the web page irrespective of the page content. Out of the fifteen classifiers, best three are ensemble, which results in better performance with overall accuracy improvement. Ten-fold cross validation has also been applied with the resulted ensemble model, which results in getting the accuracy of 99.6% in the proposed scheme.
Read full abstract