Abstract
In this project, we focus on electronic mail, one of the most important means of communication among information professionals. As its use and significance among the general populace grows, so does its importance and utility. It has allowed for more adaptability and convenience in communication, both in the private and professional spheres. The increased use of email has led to a rise in spam as well as legitimate messages. An email that is sent to a large number of people without the sender’s knowledge or consent is considered spam. Millions of internet users, both casual and professional, are currently frustrated by the widespread problem of email spam. The purpose of this study is to provide a hybrid approach to machine learning for identifying spam in email. Bagging and boosting of machine learning-based multinomial Decision Tree, Naive Bayes, KNN, Random Forest, and the SVM method are the proposed hybrid techniques. The bagging method uses a concurrent combination of weak classifiers to boost classification accuracy. The standard deviation of misclassifications is decreased by using bagging. Alternatively, by linking the classifiers in a series fashion, the boosting strategy can construct a robust classifier out of two or more relatively weak classifiers. Improved classification results can be achieved through reduced bias and variance thanks to the use of boosting. In order to detect spam in emails, it is necessary to take into account datasets, pre-process those datasets, extract and pick features, and classify the data. In this study, we evaluate the feasibility of conducting experiments using data from the Ling-Spam Corpus and the CSDMC2010 Spam Corpus. According to the stop-word list and lemmatiser, Ling-Spam Corpus’s dataset is split into four different directories: bare, lemm, lemm stop, and stop. In addition, pre-processing consists of converting strings to word vectors (tokenization), stemming words, and removing stop words. Since the Ling Spam Corpus is already organised according to the stop-word list and the lemmatiser, only the CSDMC2010 Spam Corpus undergoes the stemming and stop words removal processes. Features are extracted and selected from the preprocessed data. The feature selection procedure in this work makes use of a correlation-based approach.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have