Abstract

Spammers are constantly evolving new spam technologies, the latest of which is image spam. Till now research in spam image identification has been addressed by considering properties like colour, size, compressibility, entropy, content etc. However, we feel the methods of identification so evolved have certain limitations due to embedded obfuscation like complex backgrounds, compression artifacts and wide variety of fonts and formats .To overcome these limitations, we have proposed 2 methodologies(however there can be more). Each methodology has 4 stages. Both the methodologies are almost similar except in the second stage where methodology I extracts low level features while the other extracts metadata features. Also a comparison between both the methodologies is shown. The method works on images with and without noise separately. Colour properties of the images are altered so that OCR (Optical Character Recognition) can easily read the text embedded in the image. The proposed methods are tested on a dataset of 1984 spam images and are found to be effective in identifying all types of spam images having (1) only text, (2) only images or (3) both text and images. The encouraging experimental results show that the methodology I achieves an accuracy of 92% while the other achieves an accuracy of 93.3%.

Highlights

  • Image spam is a kind of spam in e-mail where the message text of the spam is presented as an image file

  • Low level and metadata features of images are extracted as they are effective against randomly added noises and simple translational shift of the images

  • After carrying out experiments we have found that 58.65% of spam images and 44.28% of normal images are smaller than 10KB

Read more

Summary

Introduction

Image spam is a kind of spam in e-mail where the message text of the spam is presented as an image file. The filters employ OCR that reads text embedded in images. It works by measuring the geometry in images, searching for shapes that match the shapes of letters, translating a matched geometric shape into real text. To defeat OCR, spammers upset the geometry of letters enough—by altering colours, for example—so that OCR can't "see" a letter, even though the human eye recognize it. To overcome this falsity, low level and metadata features of images are extracted as they are effective against randomly added noises and simple translational shift of the images. We review the prior significant work in the area of image spam identification

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.