Study on the Effect of Preprocessing Methods for Spam Email Detection

Fariska Zakhralativa Ruskanda

doi:10.21108/indojc.2019.4.1.284

Fariska Zakhralativa Ruskanda

Open Access

https://doi.org/10.21108/indojc.2019.4.1.284

Copy DOI

Journal: Indonesian Journal on Computing (Indo-JC)	Publication Date: Mar 22, 2019
Citations: 11	License type: CC BY 4.0

Affiliation: Tama University

Abstract

The use of email as a communication technology is now increasingly being exploited. Along with its progress, email spam problem becomes quite disturbing to email user. The resulting negative impacts make effective spam email detection techniques indispensable. A spam email detection algorithm or spam classifier will work effectively if supported by proper preprocessing steps (noise removal, stop words removal, stemming, lemmatization, term frequency). This research studies the effect of preprocessing steps on the performance of supervised spam classifier algorithms. Experiments were conducted on two widely used supervised spam classifier algorithms: Naïve Bayes and Support Vector Machine. The evaluation is performed on the Ling-spam corpus dataset and uses evaluation metrics: accuracy. The experimental results show that different preprocessing steps give different effects to different classifier.

Highlights

THE digital age and expansion of the World Wide Web (WWW) has resulted in a flood of communications and information on the internet
Pre-processing methods plays an important role in all classification tasks, including spam email detection
If used correctly, the pre-processing method will provide a significant increase in classification results

Summary

INTRODUCTION

THE digital age and expansion of the World Wide Web (WWW) has resulted in a flood of communications and information on the internet. It is necessary to prepare the emails to be ready for analysis These pre-processing steps can affect the overall performance of the detection algorithm. The study of the effect of using various combinations of pre-processing steps on some spam detection algorithms is proposed in this paper. To evaluate the performance of the detection algorithm after the pre-processing step applied, we used Lingspam corpus – a publicly available collection of total messages from linguistic mailing lists. This is a balanced spam email dataset – the condition where equal instances for both classes: spam and ham.

LITERATURE REVIEW

PRE-PROCESSING METHODS IN EMAIL SPAM DETECTION

Noise Removal

Lemmatization

Term Frequency and TF-IDF

EXPERIMENTS

Dataset and Evaluation Metric

Compared Pre-processing Methods and Discussion

Findings

CONCLUSION