Abstract

The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call