Two-step email spam detection: comparing machine and deep learning accuracy

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Artificial intelligence (AI) continues to be a transformative field, offering significant contributions to data science by supporting optimal decision-making processes. One notable application of AI is in digital forensics, particularly in spam email classification. This paper presents a two-step approach to differentiate between regular and spam emails. In the first step, emails are evaluated for vulnerabilities based on three key criteria: varying time intervals between Mail Transfer Agents (MTA), the presence of binary attachments, and inconsistencies in IP addresses associated with the same user. In the second step, a comparative study is conducted between Machine Learning (ML) and Deep Learning (DL) algorithms to identify the most effective method for achieving accurate classification results. The findings demonstrate that the Support Vector Machine (SVM) algorithm from ML outperforms the Recurrent Neural Network (RNN) algorithm from DL, achieving an accuracy rate of 96 % compared to 90 %. A notable conclusion from this research is that manual pre-processing leads to more accurate results and better interpretability compared to automatic pre-processing. This highlights the importance of human intervention in certain stages of AI-driven processes, even when using advanced algorithms. The results suggest that a combination of strategic criteria evaluation and algorithm selection is essential for enhancing the precision of spam classification in digital forensics.

ReferencesShowing 10 of 14 papers
  • Cite Count Icon 1138
  • 10.1145/1124772.1124861
Why phishing works
  • Apr 22, 2006
  • Rachna Dhamija + 2 more

  • Cite Count Icon 57
  • 10.1109/iccons.2018.8662957
Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle Swarm Optimization
  • Jun 1, 2018
  • Kriti Agarwal + 1 more

  • Cite Count Icon 41
  • 10.1109/cybersecpods.2019.8885143
Classifying Phishing Email Using Machine Learning and Deep Learning
  • Jun 1, 2019
  • Sikha Bagui + 3 more

  • Cite Count Icon 6
  • 10.5220/0008119805290534
An Analysis of User Behaviors in Phishing eMail using Machine Learning Techniques
  • Jan 1, 2019
  • Yi Li + 2 more

  • Open Access Icon
  • Cite Count Icon 1
  • 10.1051/ro/2015057
Clustering of optimized data for email forensics
  • Oct 1, 2016
  • RAIRO - Operations Research
  • Dhai Eddine Salhi + 2 more

  • Cite Count Icon 869
  • 10.1145/238386.238530
Email overload
  • Jan 1, 1996
  • Steve Whittaker + 1 more

  • Cite Count Icon 2633
  • 10.1162/106454699568728
Ant Algorithms for Discrete Optimization
  • Apr 1, 1999
  • Artificial Life
  • Marco Dorigo + 2 more

  • Cite Count Icon 11
  • 10.1007/978-981-13-0224-4_13
Classification of Spam Email Using Intelligent Water Drops Algorithm with Naïve Bayes Classifier
  • Jul 10, 2018
  • Maneet Singh

  • Cite Count Icon 10
  • 10.4018/ijssci.2021100103
Email Classification for Forensic Analysis by Information Gain Technique
  • Oct 1, 2021
  • International Journal of Software Science and Computational Intelligence
  • Dhai Eddine Salhi + 2 more

  • Cite Count Icon 1
  • 10.1109/icmla.2018.00156
Realtime Email Delivery Failure Prediction Using the One-vs-All Classifier
  • Dec 1, 2018
  • Giruba Beulah S.E + 2 more

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.21271/zjpas.34.2.3
Comprehensive Study for Breast Cancer Using Deep Learning and Traditional Machine Learning
  • Apr 12, 2022
  • ZANCO JOURNAL OF PURE AND APPLIED SCIENCES

Comprehensive Study for Breast Cancer Using Deep Learning and Traditional Machine Learning

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icite54466.2022.9759865
Spam Email Detection with Affect Intensities using Recurrent Neural Network Algorithm
  • Jan 22, 2022
  • Nurafifah Alya Farahisya + 1 more

A large number of email users triggers an increase in the occurrence of spam in emails to gain benefits for some parties but harm others and also email users. Spam emails usually contain advertisements or criminal acts such as phishing which implicitly contain human emotions in them. It is quite difficult and takes time to differentiate between a large number of spam and ham emails. This problem can be overcome by using deep learning technology. One of which is a neural network that can classify spam emails. This paper uses the spam and ham Enron email corpus dataset. This study will add emotional features in extracting its features. The steps taken include text preprocessing, feature extraction using tf-idf, and lexicon-based emotion features, followed by classification using RNN to detect spam in emails. A comparison with other methods is also provided by comparing the proposed method to Naïve Bayes and Support-Vector Machine (SVM) algorithm based on precision and accuracy. In addition, this study also compares the effect of using affect intensities on the performance of algorithms. The results show that RNN outperforms other methods by showing the highest accuracy 99% and the precision of 99.1%. Adding effect intensities to the model would increase the model recognition results.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/iciem54221.2022.9853177
Identification of Endometrial Cancer in Myometrium using RNN and SVM Algorithms for Accuracy and Sensitivity improvement
  • Apr 27, 2022
  • Pasala Nagamohan Reddy + 1 more

The objective of this work is to compare the Recurrent Neural Network (RNN) algorithm and Support Vector Machine (SVM) algorithm in the identification of endometrial cancer based on its accuracy and sensitivity measurements. Materials and Methods: The endometrial cancer dataset, obtained from the National Institute of Endometrial Cancer Diseases (NIECE), contains 768 patient health records that were used to train (80 %) and test (20 %) the predictive model in MATLAB and the statistical analysis is done using SPSS software. For this research work 768 images were used with the pixel size of 3048×2048 and these images are taken from the pap smear slide dataset. The RNN algorithm is used and compared with the SVM algorithm. The sample size is estimated for two groups (RNN & SVM) with G-power of 80 % and 0.05 Type I/II Error rate (Alpha). Results: The predictive model using RNN algorithm shows a higher accuracy of 93.90 ± 0.3160 and sensitivity of 91.0400 ± 1.07207 followed by the significance value of 0.002 than SVM algorithm with accuracy of 88.10 ± 0.9940 and sensitivity of 86.1700 ± 1.36793 with the significance value of 0.000 using 2-tailed test in SPSS. Conclusion: Based on the outcome of the proposed work RNN classifier shows significantly better performance than the SVM classifier in the innovative detection of endometrial cancer.

  • Research Article
  • Cite Count Icon 3
  • 10.16984/saufenbilder.1264476
Machine Learning Based Classification for Spam Detection
  • Apr 30, 2024
  • Sakarya University Journal of Science
  • Serkan Keskin + 1 more

Electronic Electronic messages, i.e. e-mails, are a communication tool frequently used by individuals or organizations. While e-mail is extremely practical to use, it is necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages created to promote a product or service, often sent frequently. It is very important to classify incoming e-mails in order to protect against malware that can be transmitted via e-mail and to reduce possible unwanted consequences. Spam email classification is the process of identifying and distinguishing spam emails from legitimate emails. This classification can be done through various methods such as keyword filtering, machine learning algorithms and image recognition. The goal of spam email classification is to prevent unwanted and potentially harmful emails from reaching the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms are used to classify spam emails and the results are compared. Algorithms with different approaches were used to determine the best solution for the problem. 5558 spam and non-spam e-mails were analyzed and the performance of the algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score metrics. The most successful result was obtained with the RF algorithm with an accuracy of 98.83%. In this study, high success was achieved by classifying spam emails with machine learning algorithms. In addition, it has been proved by experimental studies that better results are obtained than similar studies in the literature.

  • Research Article
  • 10.1093/eurheartj/ehab724.3069
ACS mortality prediction in Asian in-hospital patients with deep learning using machine learning feature selection
  • Oct 12, 2021
  • European Heart Journal
  • S Kasim + 4 more

Background Thrombolysis in Myocardial infarction (TIMI) is used in predicting the mortality rate of the acute coronary syndrome (ACS) patients. TIMI was developed based on the Western cohort with limited data on the Asian cohort. There are separate TIMI scores for STEMI and NSTEMI. Deep learning (DL) and machine learning (ML) algorithms such as support vector machine (SVM) in population-specific dataset resulted in a higher area under the curve (AUC) to TIMI. The limitation of DL is selected features by the algorithm is unknown compared to ML algorithms. Purpose To construct a single in-hospital mortality risk scoring system that combines SVM feature importance and the DL algorithm in ASIAN patients with ACS that is applicable for both STEMI and NSTEMI patients. To investigate DL performance constructed using predictors selected from SVM feature extraction and DL using complete features and compare with TIMI risk score for STEMI and NSTEMI patients. Methods We constructed four algorithms: i) DL and SVM algorithm with feature selected from SVM variable importance, ii) DL and SVM algorithm without feature selection. SVM feature importance with the backward elimination method is used to select and rank important variables. We used registry data from the National Cardiovascular Disease Database of 13190 patient's data. Fifty-four parameters including demographics, cardiovascular risk, medications and clinical variables were considered. AUC was used as the performance evaluation metric. All algorithms were validated using validation dataset and compared to the conventional TIMI for STEMI and NSTEMI. Results Validation results in Figure 1 are by STEMI and NTEMI patients. Both DL algorithms outperformed ML and TIMI score on validation data. Similar performance is observed for DL and SVM algorithms using all predictors (54 predictors) with DL and SVM algorithm using selected predictors (14 predictors). Predictors selected by the SVM feature selection are: age, heart rate, Killip class, fasting blood glucose, ST-elevation, CABG, cardiac catheterization, angina episode, HDLC, LDC, other lipid-lowering agents, statin, anti-arrhythmic agent, oralhypogly. CABG and pharmacotherapy drugs as selected predictors improve mortality prediction compared to TIMI score. In DL, 25.87% of STEMI patients and 19.71% of NSTEMI patients are estimated as high risk (risk probabilities of >50%). TIMI underestimated the risk of mortality of high-risk patients (≥5 risk scores) with 13.08% from STEMI patients and 4.65% from NSTEMI patients (Figure 2). Conclusions In the ASIAN multi-ethnicity population, patients with ACS can be better classified using one single algorithm compared to the conventional method like TIMI which requires two different scores. Combining ML feature selection with DL allows the identification of distinct factors related to in-hospital mortality of ACS patients in a unique ASIAN population for better mortality prediction. Funding Acknowledgement Type of funding sources: Public grant(s) – National budget only. Main funding source(s): Technology Development Fund 1 Figure 1. Performance resultsFigure 2. Analysis on the validation set

  • Research Article
  • Cite Count Icon 15
  • 10.1097/corr.0000000000001679
CORR Synthesis: When Should the Orthopaedic Surgeon Use Artificial Intelligence, Machine Learning, and Deep Learning?
  • Feb 17, 2021
  • Clinical orthopaedics and related research
  • Michael P Murphy + 1 more

CORR Synthesis: When Should the Orthopaedic Surgeon Use Artificial Intelligence, Machine Learning, and Deep Learning?

  • Research Article
  • Cite Count Icon 9
  • 10.1111/ajo.13661
Artificial intelligence: Friend or foe?
  • Apr 1, 2023
  • Australian and New Zealand Journal of Obstetrics and Gynaecology
  • Anusch Yazdani + 2 more

Artificial intelligence: Friend or foe?

  • Book Chapter
  • 10.1108/s1548-643520230000020016
Index
  • Mar 13, 2023

Index

  • Research Article
  • Cite Count Icon 16
  • 10.2144/fsoa-2022-0010
Artificial intelligence in interdisciplinary life science and drug discovery research.
  • Mar 8, 2022
  • Future science OA
  • Jürgen Bajorath

Artificial intelligence in interdisciplinary life science and drug discovery research.

  • Research Article
  • 10.11648/j.jccee.20251002.12
A Review on Aerospace-AI, with Ethics and Implications
  • Mar 11, 2025
  • Journal of Civil, Construction and Environmental Engineering
  • Derrick Mirindi + 3 more

The rapid advancement of aerospace technology, coupled with the exponential growth in available data, has catalyzed the integration of artificial intelligence (AI) across the aerospace sector. This comprehensive review examines the state-of-the-art applications of AI, machine learning (ML), deep learning (DL), and generative artificial intelligence (GenAI) in aerospace. Our analysis reveals that ML algorithms demonstrate remarkable capabilities: Random forest (RF) algorithm achieves precision within 10 meters for trajectory prediction, while support vector machines (SVMs) algorithms show 99.89% accuracy in aircraft fault detection. Decision trees (DTs) algorithms excel in aircraft system diagnostics with adaptive learning capabilities. In the realm of deep learning, convolutional neural networks (CNNs) algorithms achieve 79% accuracy in satellite component detection and structural inspection, while recurrent neural networks (RNNs) algorithms and Long Short-Term Memory (LSTM) networks demonstrate superior performance in 4D trajectory prediction and engine health monitoring. GenAI, particularly through Generative adversarial networks (GANs), has revolutionized airfoil design optimization, achieving less than 1% error in profile fitting and 10% error in aerodynamic stealth characteristics. However, these algorithms face scalability challenges when processing large-scale datasets in real-time applications, particularly in mission-critical scenarios. Our research also identifies four ethical considerations, including bias prevention in automated systems, transparency in decision-making processes, privacy protection in data handling, and the implementation of important safety protocols. This study provides a foundation for understanding the current landscape of aerospace-AI integration while highlighting the importance of addressing ethical implications in future developments. The successful implementation of these technologies will require continuous innovation in validation methodologies, establish universal ethical considerations standard, and enhanced community engagement through citizen science initiatives to involve stakeholders.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.33166/aetic.2022.03.003
Comparative Analysis of Intrusion Detection System Using Machine Learning and Deep Learning Algorithms
  • Jul 1, 2022
  • Annals of Emerging Technologies in Computing
  • Johan Note + 1 more

Attacks against computer networks, “cyber-attacks”, are now common place affecting almost every Internet connected device on a daily basis. Organisations are now using machine learning and deep learning to thwart these types of attacks for their effectiveness without the need for human intervention. Machine learning offers the biggest advantage in their ability to detect, curtail, prevent, recover and even deal with untrained types of attacks without being explicitly programmed. This research will show the many different types of algorithms that are employed to fight against the different types of cyber-attacks, which are also explained. The classification algorithms, their implementation, accuracy and testing time are presented. The algorithms employed for this experiment were the Gaussian Naïve-Bayes algorithm, Logistic Regression Algorithm, SVM (Support Vector Machine) Algorithm, Stochastic Gradient Descent Algorithm, Decision Tree Algorithm, Random Forest Algorithm, Gradient Boosting Algorithm, K-Nearest Neighbour Algorithm, ANN (Artificial Neural Network) (here we also employed the Multilevel Perceptron Algorithm), Convolutional Neural Network (CNN) Algorithm and the Recurrent Neural Network (RNN) Algorithm. The study concluded that amongst the various machine learning algorithms, the Logistic Regression and Decision tree classifiers all took a very short time to be implemented giving an accuracy of over 90% for malware detection inside various test datasets. The Gaussian Naïve-Bayes classifier, though fast to implement, only gave an accuracy between 51-88%. The Multilevel Perceptron, non-linear SVM and Gradient Boosting algorithms all took a very long time to be implemented. The algorithm that performed with the greatest accuracy was the Random Forest Classification algorithm.

  • Research Article
  • Cite Count Icon 7
  • 10.1155/2023/6675523
Prediction of Crop Yield by Support Vector Machine Coupled with Deep Learning Algorithm Procedures in Lower Kulfo Watershed of Ethiopia
  • Dec 4, 2023
  • Journal of Engineering
  • Abebe Temesgen Ayalew + 1 more

Sensible and judicious utilization of water for agriculture in conjunction with prediction techniques increases the crop yield. The Ethiopian economy relies on and is exclusively dependent on agricultural-based activities. Different soil compositions (nitrogen, phosphorous, and potassium), crop alternation, soil dampness, and climate conditions play an imperative contribution in cultivation. The primary purpose of this study was to conduct a machine learning approach which can be practiced dynamically for efficient farming at a low cost. The support vector machine (SVM) was applied as a machine learning procedure, whereas long short-term memory (LSTM) and the recurrent neural network (RNN) were considered as deep learning procedures. The research comprised a model that is combined with machine learning procedures (ANN, random forest, and decision tree) to know efficient and appropriate crop types. The planned model is improved through conducting deep learning methods incorporated to the existing practice for different crop condition. Pure data and related evidence are attained concerning the quantities of soil constituents desired through their expenditures distinctly. It delivers well precision as compared to the current model examining the specified documents and assisting the local agronomists in forecasting different types of crop and gain benefits. In RNN, LSTM, and SVM algorithms, the accuracy is determined as 96% which is comparatively preferable as compared to other machine learning procedures under different feature and crop types. The techniques are evaluated in terms of percentage in prediction accuracy. The results generated are important for agrarians, experts, researchers, and local farmers to maximize the crop productivity and help to enhance agriculture and climate change-related decisions, especially in low-to-middle-income countries.

  • Research Article
  • Cite Count Icon 4
  • 10.54097/hset.v39i.6640
Classification of Spam E-mail based on Naïve Bayes Classification Model
  • Apr 1, 2023
  • Highlights in Science, Engineering and Technology
  • Shaopeng Cheng

With the rising number of spam email, the need of more sufficient antispam filter is surging. Phishing attack can lead to extremely large losses of companies and individual, even more than 1 billion dollars in one year. This paper investigates and combines Naïve Bayes Classification and clustering algorithm in the application of identifying spam emails. With sample emails to create a dynamic dictionary containing most frequent words in spam and normal emails, this distribution of spam filter will provide a stricter method to prevent spam emails than those methods used in mail companies, e.g., Google, Yahoo, and Outlook.com. Besides, this paper also compares several algorithms used today in classifying spams and the future techniques of deep learning and machine learning’s application in classifying spam emails. According to the analysis, Google’s algorithm has the most comprehensive function, but such algorithm has less strict rule than Yahoo’s. Outlook.com, as a combination of Microsoft application, it has a unique algorithm for encrypting and filtering spams. Overall, these results shed light on guiding further exploration of both comprehensive and strict rule for classifying spams.

  • Research Article
  • Cite Count Icon 134
  • 10.1016/j.matt.2020.04.019
Using Deep Learning to Predict Fracture Patterns in Crystalline Solids
  • May 20, 2020
  • Matter
  • Yu-Chuan Hsu + 2 more

Using Deep Learning to Predict Fracture Patterns in Crystalline Solids

  • Research Article
  • Cite Count Icon 6
  • 10.1111/gcb.16696
Unlocking the power of machine learning for Earth system modeling: A game-changing breakthrough.
  • Apr 2, 2023
  • Global Change Biology
  • Jiquan Chen

Unlocking the power of machine learning for Earth system modeling: A game-changing breakthrough.

More from: Electrotehnica, Electronica, Automatica
  • Research Article
  • 10.46904/eea.25.73.2.1108011
Two-step email spam detection: comparing machine and deep learning accuracy
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Dhai Eddine Salhi + 3 more

  • Research Article
  • 10.46904/eea.25.73.2.1108004
Enhanced predictive control for nine-level packed U-cell inverter in grid-tied PV systems
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Abderraouf Touafek + 4 more

  • Research Article
  • 10.46904/eea.25.73.2.1108001
Study and design of an asynchronous electrical machine with inverted construction
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Mircea Ignat + 2 more

  • Research Article
  • 10.46904/eea.25.73.2.1108005
Enhanced MPPT technique employing fuzzy logic control for variable-speed wind turbines
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Rabia Behloul + 2 more

  • Research Article
  • 10.46904/eea.25.73.2.1108010
Nodal congestion price and IMO price forecasting in restructure power system market
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Writwik Balow + 3 more

  • Research Article
  • 10.46904/eea.25.73.2.1108002
Performance investigation of new reduced switch count thirty-three level multilevel inverter
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Manivel Murugesan + 7 more

  • Research Article
  • 10.46904/eea.25.73.2.1108009
Optimization design of structural parameters of duplex geared pump in hydraulic automatic transmission
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Hong Wang + 2 more

  • Research Article
  • 10.46904/eea.25.73.2.1108006
Intelligent lift solutions for energy efficiency
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Andrei Dorogan + 1 more

  • Research Article
  • 10.46904/eea.25.73.2.1108007
Control strategies design for the hybrid engineering vehicle drive system
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Di Hou + 2 more

  • Research Article
  • 10.46904/eea.25.73.2.1108008
Enhanced model-free predictive control using a Cuckoo search algorithm
  • May 30, 2025
  • Electrotehnica, Electronica, Automatica
  • Zakaria Lammouchi + 3 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon