An Empirical Study on the Application of Data Augmentation Techniques to Enhance the Performance of CNN Models in Spam Detection

Martin J Gruber,Matthew T Schneider

doi:10.56397/ist.2024.05.06

Abstract

Spam detection is a critical task in cybersecurity, aiming to filter out unsolicited and potentially harmful communications. This study investigates the impact of various data augmentation techniques on enhancing the performance of Convolutional Neural Network (CNN) models for spam detection. Utilizing the Enron Email Dataset, we implemented several augmentation methods, including synonym replacement, random insertion, random swap, random deletion, back translation, and noise addition. Our results indicate significant performance improvements with these techniques. The baseline CNN model achieved an accuracy of 87.5%, precision of 85.2%, recall of 83.7%, and F1-score of 84.4%. The application of back translation, the most effective technique, increased accuracy to 90.3% and F1-score to 88.0%. These findings demonstrate the potential of data augmentation in improving spam detection systems, providing a robust foundation for future research. The study also highlights the importance of combining augmentation techniques and adapting them to different languages and real-world scenarios for even greater performance gains.

Full Text