Study on the Effect of Preprocessing Methods for Spam Email Detection
The use of email as a communication technology is now increasingly being exploited. Along with its progress, email spam problem becomes quite disturbing to email user. The resulting negative impacts make effective spam email detection techniques indispensable. A spam email detection algorithm or spam classifier will work effectively if supported by proper preprocessing steps (noise removal, stop words removal, stemming, lemmatization, term frequency). This research studies the effect of preprocessing steps on the performance of supervised spam classifier algorithms. Experiments were conducted on two widely used supervised spam classifier algorithms: Naïve Bayes and Support Vector Machine. The evaluation is performed on the Ling-spam corpus dataset and uses evaluation metrics: accuracy. The experimental results show that different preprocessing steps give different effects to different classifier.
- Research Article
5
- 10.18280/isi.270610
- Dec 31, 2022
- Ingénierie des systèmes d information
Spam is a major concern in present emails, and there are several reasons for sending spam emails. The two most common ones are advertising and fraud. If supported by suitable preprocessing approaches, the detection algorithm for spam email or spam classifier will function effectively (removal of noise, removal of stop words, stemming, lemmatization, term frequency). Spam that combines both text and image components is referred to as hybrid spam. Compared to spam emails with images and text, it is more unsafe and complex. To distinguish spam or ham, we must use an effective and smart approach in order to have a strong representation of emails and improve classification performance. In this paper, we propose a multi-modal architecture relying on a feature model (MMA-FM) that concatenates two embedding vectors. The text and image sections of the similar emails were separated using a hybrid model (IMTF-IDF+Skip-thoughts) and the convolutional neural network (CNN) as a feature extraction technique. The extracted features are concatenated and given to Naïve Bayes (NB) and Support Vector Machine (SVM) models to classify hybrid email as either spam or ham. In this paper we used two hybrid datasets: Enron, Dredze, and TREC 2007, which are publicly accessible corpora. Our results show that the SVM model provides an accuracy of 99.16%, which is higher when compared to the Naïve Bayes method.
- Conference Article
3
- 10.1109/icnwc57852.2023.10127237
- Apr 5, 2023
In this project, we focus on electronic mail, one of the most important means of communication among information professionals. As its use and significance among the general populace grows, so does its importance and utility. It has allowed for more adaptability and convenience in communication, both in the private and professional spheres. The increased use of email has led to a rise in spam as well as legitimate messages. An email that is sent to a large number of people without the sender’s knowledge or consent is considered spam. Millions of internet users, both casual and professional, are currently frustrated by the widespread problem of email spam. The purpose of this study is to provide a hybrid approach to machine learning for identifying spam in email. Bagging and boosting of machine learning-based multinomial Decision Tree, Naive Bayes, KNN, Random Forest, and the SVM method are the proposed hybrid techniques. The bagging method uses a concurrent combination of weak classifiers to boost classification accuracy. The standard deviation of misclassifications is decreased by using bagging. Alternatively, by linking the classifiers in a series fashion, the boosting strategy can construct a robust classifier out of two or more relatively weak classifiers. Improved classification results can be achieved through reduced bias and variance thanks to the use of boosting. In order to detect spam in emails, it is necessary to take into account datasets, pre-process those datasets, extract and pick features, and classify the data. In this study, we evaluate the feasibility of conducting experiments using data from the Ling-Spam Corpus and the CSDMC2010 Spam Corpus. According to the stop-word list and lemmatiser, Ling-Spam Corpus’s dataset is split into four different directories: bare, lemm, lemm stop, and stop. In addition, pre-processing consists of converting strings to word vectors (tokenization), stemming words, and removing stop words. Since the Ling Spam Corpus is already organised according to the stop-word list and the lemmatiser, only the CSDMC2010 Spam Corpus undergoes the stemming and stop words removal processes. Features are extracted and selected from the preprocessed data. The feature selection procedure in this work makes use of a correlation-based approach.
- Conference Article
10
- 10.1109/isemantic52711.2021.9573178
- Sep 18, 2021
Electronic Medical Record (EMR) is an important element of information technology in healthcare sector. EMR is an electronic record containing health-related information on patients that can be created and managed by authorized physician and staff in a healthcare service organization. EMR is a framework for determining diagnosis and treatment. EMR has free text and unstructured format which makes it more difficult to extract the hidden information as a decision support system. This study performs classification from Indonesian EMR for clinical decision support system (CDSS) in classifying patient diagnosis using Term Frequency-Inverse Document Frequency (TF-IDF) for feature extraction and Support Vector Machine (SVM) for classifier method. SVM is a powerful algorithm in high-dimensional data such as in textual data processing. The focus diagnoses classified in this paper are tuberculosis, cancer, diabetes mellitus, hypertension, and chronic kidney which have high prevalence rates in Indonesia. The model is built by considering the kernel function and the use of stopword removal or without stopword removal. The result showed that TF - IDF and SVM method could be used effectively to predict diagnosis with stop word removal. Classification performance increased with stopword removal on all SVM kernels with accuracy in linear kernel 89.91 %, polynomial kernel 90.58%, RBF kernel 90.75%, and sigmoid kernel 91.03%.
- Research Article
36
- 10.5815/ijmecs.2016.07.08
- Jul 8, 2016
- International Journal of Modern Education and Computer Science
The increasing use of e-mail in the world because of its simplicity and low cost, has led many Internet users are interested in developing their work in the context of the Internet. In the meantime, many of the natural or legal persons, to sending e-mails unrelated to mass. Hence, classification and identification of spam emails is very important. In this paper, the combined Particle Swarm Optimization algorithms and Artificial Neural Network for feature selection and Support Vector Machine to classify and separate spam used have and finally, we compared the proposed method with other methods such as data classification Self Organizing Map and K-Means based on criteria Area Under Curve. The results indicate that the Area Under Curve in the proposed method is better than other methods.
- Research Article
8
- 10.1504/ijcistudies.2018.10016073
- Jan 1, 2018
- International Journal of Computational Intelligence Studies
Automatic classification of poetic content is very challenging from the computational linguistic point of view. For library suggestion framework, poetries can be grouped on different measurements, for example, artist, day and age, assumptions, and topic. In this work, content-based Punjabi poetry classifier was built utilising Weka toolset. Four unique classes were manually populated with 2,034 poetries. NAFE, LIPA, RORE, PHSP classes comprises of 505, 399, 529 and 601 number of poems, individually. These poems were passed to different pre-processing sub stages, for example, tokenisation, noise removal, stop word removal, special symbol removal. An aggregate of 31,938 tokens was separated, after passing through pre-processing layer, and weighted using term frequency (TF) and term frequency-inverse document frequency (TF-IDF) weighting plan. Depending upon poetic elements of poetry, two different poetic features (orthographic and phonemic) were experimented to build a classifier using machine learning algorithms. Naive Bayes, support vector machine, hyper pipes, and K-nearest neighbour algorithms experimented with two poetic features. The results revealed that addition of poetic features does not boost the performance of Punjabi poetry classification task. Using poetic features, the best performing algorithm is SVM and highest accuracy (71.98%) is achieved considering orthographic features.
- Research Article
28
- 10.34028/iajit/17/1/5
- Jan 1, 2019
- The International Arab Journal of Information Technology
Analysis of poetic text is very challenging from computational linguistic perspective. Computational analysis of literary arts, especially poetry, is very difficult task for classification. For library recommendation system, poetries can be classified on various metrics such as poet, time period, sentiments and subject matter. In this work, content-based Punjabi poetry classifier was developed using Weka toolset. Four different categories were manually populated with 2034 poems Nature and Festival (NAFE), Linguistic and Patriotic (LIPA), Relation and Romantic (RORE), Philosophy and Spiritual (PHSP) categories consists of 505, 399, 529 and 601 numbers of poetries, respectively. These poetries were passed to various pre-processing sub phases such as tokenization, noise removal, stop word removal, and special symbol removal. 31938 extracted tokens were weighted using Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme. Based upon poetry elements, three different textual features (lexical, syntactic and semantic) were experimented to develop classifier using different machine learning algorithms. Naive Bayes (NB), Support Vector Machine, Hyper pipes and K-nearest neighbour algorithms were experimented with textual features. The results revealed that semantic feature performed better as compared to lexical and syntactic. The best performing algorithm is SVM and highest accuracy (76.02%) is achieved by incorporating semantic information associated with words.
- Research Article
8
- 10.32520/stmsi.v13i4.2701
- Jul 29, 2024
- SISTEMASI
In Indonesia, the government, through the Indonesian National Police (POLRI), has just released a new regulation, the Electronic Traffic Law Enforcement (ETLE). A traffic ticket policy is carried out electronically through camera monitoring connected directly to the vehicle registration certificates (STNK) database. The government can measure people's likes or dislikes of these public policies through sentiment analysis. There have been studies that have applied sentiment analysis to find out people's responses to ETLE. However, in terms of performance, this model only has an accuracy of 0.42. This study proposes the use of a support vector machine (SVM), term frequency-inversed document frequency (TF-IDF), and mean decrease in impurity (MDI) to evaluate polarization sentiment analysis on ETLE policies. First, we retrieve tweets about ETLE from Twitter. Then we do text analysis pre-processing and the remove stop words process. The next step is to carry out the TF-IDF process. We apply two feature selection methods for our comparison: MDI and recurrent feature elimination (RFE). Next, we compare two classification models, namely naïve Bayes and SVM. Some of the metrics that we use to evaluate the pre-processing stage are the probability density function (PDF) and the t-test. We use the bag of words (BoW) to evaluate the remove stop words stage. Finally, sensitivity, specificity, and the receiver operating curve (ROC) are for evaluating feature selection methods and classification methods. The test results show that TF-IDF produces 1,022 new features. The combination of the methods we used resulted in the six models we compared. SVM+TF-IDF+MDI is the model with the best performance compared to the other five models. Accuracy and area under curve (AUC) scores are 0.99 and 0.97, respectively.
- Research Article
22
- 10.1155/2023/6648970
- Sep 20, 2023
- Applied Computational Intelligence and Soft Computing
The use of short message service (SMS) and e-mail have increased too much over the last decades. 80% of people do not read e-mails while 98% of cell phone users daily read their SMS. However, these communication media are unsafe and can produce malicious attacks called spam. The e-mails that pretend to be from a trusted company to provide “financial or personal information” are phishing e-mails. These e-mails contain some links; users might download malicious software on their computers when they click on them. Most techniques and models are developed to automatically detect these “SMS and e-mails” but none of them achieved 100% accuracy. In previous studies using machine learning (ML), spam detection using a small dataset has resulted in lower accuracy. To counter this problem, in this paper, multiple classifiers of ML and a classifier of deep learning (DL) were applied to the SMS and e-mail dataset for spam detection with higher accuracy. After conducting experiments on the real dataset, the researchers concluded that the proposed system performed better and more accurately than previously existing models. Specifically, the support vector machine (SVM) classifier outperformed all others. These results suggest that SVM is the optimal choice for classification purposes.
- Conference Article
1
- 10.1109/icnlp52887.2021.00007
- Mar 1, 2021
Spam email detection is a research hotspot, and the most efficient detection method is based on deep learning. In the context of the extensive use of pre-trained word vectors in deep neural networks, this paper studies the impact of pre-trained word vector models on the Text-CNN-based spam classification model, and uses token granularity matching technology to optimize the word2vec pre-trained word vector model in the vector representation on the spam email. By comparing the accuracy and time complexity of the spam classification with or without token granularity matching, it can be concluded that the Word2Vec pre-trained word vectors combined with token granularity processing can improve the performance of the Text-CNN model on spam email classification.
- Research Article
- 10.35134/komtekinfo.v12i4.659
- Dec 30, 2025
- Jurnal KomtekInfo
Advances in information technology and the increasing use of social media have significantly influenced the behavior of Generation Z. The generation born between 1997 and 2012 is known to be very familiar with the digital world, but also faces challenges such as lack of in-person social interaction and the risk of mental health disorders. This study aims to identify and classify public sentiment towards Generation Z on social media, especially on platform X (formerly Twitter). The method used is the Support Vector Machine (SVM). This research was carried out through several stages, namely the collection of 1607 data in the form of text using crawling techniques, pre-processing of text (tokenization, case folding, removal of stopwords, stemming, and normalization), and feature extraction using the Term Frequency-Inverse Document Frequency (TF-IDF) method. The processed data is then classified into three sentiment categories: positive, negative, and neutral using SVM. Evaluation was carried out by measuring accuracy, recall value, and F1-score value through a confusion matrix. The results showed that the measurement of an accuracy value of 85%, a precision value of 85%, a value of recall of 95% and an F1-score value of 90% that SVM was able to classify sentiment with high accuracy and stability. In addition, SVM has been shown to be more effective than other methods studied in previous studies. The data analyzed shows that most sentiment towards Generation Z is negative, reflecting public concern about the behavior and mindset of this generation. This research is expected to be a reference for academics, practitioners, and policymakers in understanding public opinion and designing targeted policies for the younger generation. Keywords: Sentiment Analysis, Generation Z, Support Vector Machine, Social Media, Machine Learning.
- Research Article
- 10.29040/ijcis.v6i3.253
- Aug 10, 2025
- International Journal of Computer and Information System (IJCIS)
The rapid spread of disinformation and fabricated news across online platforms poses a critical risk to informed public engagement and the foundations of democratic governance. This study examines how well different machine learning techniques can classify fake news, using textual features extracted through the Term Frequency–Inverse Document Frequency (TF-IDF) method. The analysis includes five commonly used algorithms like Logistic Regression, Support Vector Machine (SVM), Naive Bayes, Random Forest, and XGBoost. A publicly accessible dataset containing annotated real and fake news articles served as the basis for training and testing these models. Dataset underwent extensive preprocessing, including tokenization, stopword removal, and TF-IDF vectorization, resulting in a sparse high-dimensional matrix of 5068 documents and 39,978 features. Performance evaluation was based on multiple metrics: train/test accuracy, misclassification rate, false positives/negatives, cross-validation mean score, and execution time. Results showed that SVM and Logistic Regression achieved the highest test accuracy (93.61% and 92.27%, respectively) and exhibited robust cross-validation scores, indicating strong generalization ability. In contrast, Naive Bayes produced faster results but suffered from a high false positive rate and lower accuracy (84.77%). Random Forest and XGBoost demonstrated good predictive power but showed signs of overfitting and moderate misclassification rates. These findings suggest that SVM and Logistic Regression are well-suited for fake news detection in textual datasets using TF-IDF features. While traditional models remain effective, future work may explore deep learning approaches and context-aware language models to enhance detection accuracy across more complex and multilingual datasets. This study contributes to the ongoing efforts to combat misinformation through automated, scalable, and interpretable machine learning techniques.
- Research Article
- 10.1142/s0219467823500341
- Jul 28, 2022
- International Journal of Image and Graphics
Twitter Spam has turned out to be a significant predicament of these days. Current works concern on exploiting the machine learning models to detect the spams in Twitter by determining the statistic features of the tweets. Even though these models result in better success, it is hard to sustain the performances attained by the supervised approaches. This paper intends to introduce a deep learning-assisted spam classification model on twitter. This classification is based on sentiments and topics modeled in it. The initial step is data collection. Subsequently, the collected data are preprocessed with “stop word removal, stemming and tokenization”. The next step is feature extraction, wherein, the post tagging, headwords, rule-based lexicon, word length, and weighted holoentropy features are extracted. Then, the proposed sentiment score extraction is carried out to analyze their variations in nonspam and spam information. At last, the diffusions of spam data on Twitter are classified into spam and nonspams. For this, an Optimized Deep Ensemble technique is introduced that encloses “neural network (NN), support vector machine (SVM), random forest (RF) and convolutional neural network (DNN)”. Particularly, the weights of DNN are optimally tuned by an arithmetic crossover-based cat swarm optimization (AC-CS) model. At last, the supremacy of the developed approach is examined via evaluation over extant techniques. Accordingly, the proposed AC-CS [Formula: see text] ensemble model attained better accuracy value when the learning percentage is 80, which is 18.1%, 14.89%, 11.7%, 12.77%, 10.64%, 6.38%, 6.38%, and 6.38% higher than SVM, DNN, RNN, DBN, MFO [Formula: see text] ensemble model, WOA [Formula: see text] ensemble model, EHO [Formula: see text] ensemble model and CSO [Formula: see text] ensemble model models.
- Research Article
- 10.3390/computers15010007
- Dec 23, 2025
- Computers
The expansion of electronic health records (EHRs) has generated a large amount of unstructured textual data, such as clinical notes and medical reports, which contain diagnostic and prognostic information. Effective classification of these textual medical notes is critical for improving clinical decision support and healthcare data management. This study presents a statistically rigorous comparative analysis of four traditional machine learning algorithms—Random Forest, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine—for multiclass classification of medical notes into four disease categories: Neoplasms, Digestive System Diseases, Nervous System Diseases, and Cardiovascular Diseases. A dataset containing 9633 labeled medical notes was preprocessed through text cleaning, lemmatization, stop-word removal, and vectorization using term frequency-inverse document frequency (TF–IDF) representation. The models were trained and optimized through GridSearchCV with 5-fold cross-validation and evaluated across five independent stratified 90-10 train–test splits. Evaluation metrics, including accuracy, precision, recall, F1-score, and multiclass ROC-AUC, were used to assess model performance. Logistic Regression demonstrated the strongest overall performance, achieving an average accuracy of 0.8469 and high macro and weighted F1 scores, followed by Support Vector Machine and Multinomial Naive Bayes. Misclassification patterns revealed substantial lexical overlap between digestive and neurological disease notes, underscoring the limitations of TF–IDF representations in capturing deeper semantic distinctions. These findings confirm that traditional machine learning models remain robust, interpretable, and computationally efficient tools for textual medical note classification, and the study establishes a transparent and reproducible benchmark that provides a solid foundation for future methodological advancements in clinical natural language processing.
- Research Article
- 10.21107/kursor.v13i1.417
- Jul 18, 2025
- Jurnal Ilmiah Kursor
Sentiment analysis plays a crucial role in natural language processing by identifying and categorizing opinions or emotions conveyed in textual data. It is widely applied across diverse fields such as product review analysis, social media monitoring, and market research. To enhance the accuracy and reliability of sentiment classification, various methods and feature extraction techniques have been explored. This study investigates the use of Support Vector Machine (SVM) for sentiment analysis, comparing three feature extraction techniques: Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words (BoW), and Word2Vec. Our findings indicate that SVM performs effectively with all three feature extraction methods, with TF-IDF yielding the highest accuracy at 0.79. Although the BoW method showed competitive results, it slightly trailed TF-IDF in k-fold validation. Word2Vec, however, exhibited the lowest performance, achieving a maximum accuracy of 0.69. A comparative analysis of accuracy, precision, recall, and F1-score highlight the superiority of TF-IDF in delivering consistent and accurate results. Further statistical analysis using ANOVA revealed no significant differences between the models across any of the evaluation metrics. Additionally, the evaluation was conducted under several scenarios, including tests on balanced and imbalanced datasets, varying dataset sizes, and different CCC parameter values for SVM. These scenarios provided deeper insights into the factors influencing the system's performance, reinforcing that TF-IDF combined with SVM remains the most effective approach in this study.
- Research Article
- 10.37034/infeb.v7i3.1266
- Sep 30, 2025
- Jurnal Informatika Ekonomi Bisnis
The digitalization of public services has encouraged the development of the Jamsostek Mobile (JMO) application by BPJS Ketenagakerjaan. This application is expected to provide convenience in accessing information, JHT claims, and other services. However, user reviews on the Google Play Store show diverse perceptions, ranging from satisfaction to technical complaints. This study aims to conduct sentiment analysis on user reviews of the JMO application by classifying opinions into positive, negative, and neutral sentiments. Data were collected through crawling from the Google Play Store and processed using text preprocessing stages, including data cleaning, case folding, stopword removal, tokenization, stemming, and Term Frequency–Inverse Document Frequency (TF-IDF) weighting. The classification process was then carried out using three machine learning algorithms, namely Support Vector Machine (SVM), Random Forest, and Logistic Regression. The results indicate that negative sentiment dominates with 46%, followed by positive sentiment at 40% and neutral at 14%. Most complaints are related to login difficulties, application errors, and technical bugs in claim features. In terms of algorithm performance, SVM with a linear kernel achieved the highest accuracy of 87.5% and an F1-score of 0.87, outperforming Random Forest (85.3%) and Logistic Regression (82.7%). Academically, this study reinforces the effectiveness of SVM in sentiment analysis using TF-IDF, while practically providing recommendations for BPJS Ketenagakerjaan to improve system stability, login speed, and reduce application bugs to enhance user satisfaction.