Abstract

Email classification using Machine Learning (ML) algorithm is an application of Artificial Intelligence (AI) where the system learns by itself to provide predictions between ham and spam mails. These algorithms predict a class based on the probability estimated from the features of the particular class label. The classifier performance gets degraded if there are large number of features with complex distributions and limited data instances. This paper illustrates the necessity of a suitable pre-processing stage to eliminate the irrelevant and redundant features before the classifier for enhancing its performance, facilitate the data visualization, reduce the storage requirements and time for modelling. The proposed methodology investigates and evaluates several feature selection and ranking algorithms in email classification process. The approaches used here are classifier-based feature selection methods, schema dependent and independent pruning methods, ranking based fast feature selection methods, cost sensitive classifier and cost-sensitive learner. A quantitative analysis was conducted using different feature selection and pruning algorithms. The important effects we observed here are feature space reduction and manageability of the learning algorithm’s computation criteria. The results of the experimentation show an enhanced accuracy of minimum 10% with better precision values. This paper concludes with significant outcomes like validation of various ranker evaluators, advantages of scheme independent methods in terms of computational factors, the effects of cost-sensitive classifier and cost-sensitive learner. Our experiments also illustrated the importance of classifier dependency in determining the sensitivity of feature selection algorithms and the peril of over-fitting the model in the email domain.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.