Abstract

Along with the barbarous growth of spams, anti-spam technologies including rule-based approaches and machine-learning thrive rapidly as well. In antispam industry, the rule-based systems (RBS) becomes the most prominent methods for fighting spam due to its capability to enrich and update rules remotely. However, the antispam filtering throughput is always a great challenge of RBS. Especially, the explosively spreading of obfuscated words leads to frequent rule update and extensive rule vocabulary expansion. These incremental obfuscated words make the filtering speed slow down and the throughput decrease. This paper addresses the challenging throughput issue and proposes a constant time complexity rule-based spam detection algorithm. The algorithm has a constant processing speed, which is independent of rule and its vocabulary size. A new special data structure, namely, Hash Forest, and a rule encoding method are developed to make constant time complexity possible. Instead of traversing each spam term in rules, the proposed algorithm manages to detect spam terms by checking a very small portion of all terms. The experiment results show effectiveness of proposed algorithm.

Highlights

  • The widespread use of Internet had grown explosively since the first establishment of Internet in 1969

  • If the time complexity of filtering algorithms of rule-based systems (RBS) can reduce to constant, the throughput issue can be solved since the expansion of rule and its vocabulary size will not slow down filtering speed ever

  • EXPERIMENTAL RESULTS The experiment is based on production environmental data of the short messages (SMS) service company cooperated with us mentioned in Introduction

Read more

Summary

INTRODUCTION

The widespread use of Internet had grown explosively since the first establishment of Internet in 1969. The scale of data is overwhelmingly increased as well [1], especially after the wide use of social networks, personal communication tools, emails and short messages (SMS) This easy-communication circumstance encouraged the numerous emerge of spams. If the time complexity of filtering algorithms of RBS can reduce to constant, the throughput issue can be solved since the expansion of rule and its vocabulary size will not slow down filtering speed ever. The project was carried out to increase filtering speed to meet its overwhelming SMS sending throughput requirement This project successfully addressed the throughput issue and decreased the time complexity of the spam detection algorithm to constant O(1). The encoding method helps the filter calculate operators in expressions automatically

RELATED WORK
AN ADDITIONAL FEATURE
CONCLUSION AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.