Efficient suspicious URL filtering based on reputation

Chia-Mei Chen,Jhe-Jhun Huang,Ya-Hui Ou

doi:10.1016/j.jisa.2014.10.005

Abstract

Enormous web pages are visited each day over a network and malicious websites might infect user machines. To identify malicious web sites, the most reliable approach is honeypot, an execution-based method. The vast amount of http traffic makes sandboxing all the web pages impossible. It is not practical in the real environments as it consumes too much computational resource and time. Only 1.3% of websites are malicious. Crawler-based approach might be not easy to find malicious websites effectively, as it has no clue if they are visited by users. Therefore, the proposed system examines only user web requests, not from crawler, in order to catch the real drive-by download web attacks. The challenge is that the web traffic is huge in real networks and an efficient filtering is desired to process large scale user requests efficiently.Based on our observation, the domains of drive-by download attacks often are unreliable and exhibit distinct attributes from the normal. To classify massive volume of web traffic in a real network, this study proposes a two-stage drive-by download attack detection mechanism: first identifying suspicious websites based on domain reputation and then sandboxing only the suspicious ones to reduce the detection time. Such detection not only reduces the required computation resources and time, but also remains the efficiency benefited from sandbox-based detection. As WHOIS database is not reliable for not every domain query can be resolved. Therefore, this study relies on queries from DNS server and proposes novel reputation attributes to distinguish the benign and the suspicious. The experimental results show that the proposed filtering yields the accuracy of 94% in simulated real network environment and efficiently saves more than 12 times of the computing time with the comparison of an improved sandboxing approach. Such two-stage detection system implemented in a real network environment with 560 thousand URL requests per day demonstrates its practicality and efficiency under large scale web requests. During the deployment on the real network, unknown malicious websites are identified which are not listed in the public accessible blacklist websites.

Full Text