An improved genetic algorithm for feature selection in the classification of Disaster-related Twitter messages

Ian P Benitez,Ariel M Sison,Ruji P Medina

doi:10.1109/iscaie.2018.8405477

Abstract

In text classification with machine learning, utilizing terms as features using vector space representation can result in the high dimensionality of feature space. This condition introduces problems including high computational cost in data analysis, as well as degradation of classification accuracy. This study improved classifier's performance in the classification of natural crisis-related Twitter messages. Feature space dimensionality through feature selection was reduced using Genetic Algorithm (GA). While there is a limitation of GA implementation in text feature selection which is the premature convergence due to lack of population diversity in the subsequent generations, GA was enhanced in its crossover operator through: a) setting a variable slice-point on the size of genes to be swapped for every offspring creation, b) using features' frequency scores in deciding the swapping of genes. Several Twitter datasets were tested applying the algorithm enhancement and performed a comparative analysis with two standard GA implementation that uses a single-point and multi-point crossover. Experimental results showed the superiority of the enhanced GA in terms of reducing the number of selected features and in improving classification accuracy using Multinomial Naive Bayes.

Full Text