Customer churn is a common problem faced by many industries, including telecommunication industries. This has resulted in the development of advanced techniques for the prediction and prevention of customer churn. The availability of stored customer data in the form of big data, together with the use of advanced and tuned machine learning (ML) algorithms, have paved the way for the realisation and extraction of useful features associated with customer behaviour and consequently the prediction of customer churn. An effective way to further improve churn prediction capability of different ML algorithms is through the employment of topological data analysis (TDA). TDA is a framework that applies topological methods to uncover the underlying hidden structural features in complex, high-dimensional data. Here, a TDA summary of 0- and 1-dimensional holes of the data, called barcode statistics, was extracted and used as an additional feature to the preprocessed customer data. To address issues such as the effective preprocessing and analysis of large customer datasets and the effective tuning of ML hyperparameters, we implement an advanced data preprocessing technique that consists of different stages such as handling of missing data, feature engineering, encoding of categorical features using the hashing encoding method, and feature selection. Without including barcode statistics in the model, the XGBoost algorithm with tuned hyperparameters achieved the best results, with accuracy of 92.71%, precision of 85.95%, recall of 92.71%, and F-measure of 89.20%. Including barcode statistics as an additional feature, the XGBoost algorithm with tuned hyperparameters achieved the best and much improved results, with accuracy of 98.50%, precision of 98.50%, recall of 98.50%, and F-measure of 98.50%. The use of TDA barcode statistics significantly improved the churn prediction capability of the ML algorithms. In addition, hyperparameter tuning is not needed when an effective data preprocessing technique is used, or when barcode statistics is used. The best accuracy of 98.5% from this work was in line with the best accuracy of 98.7% from a related work, but interestingly, the best precision of 98.5% from this work was superior to the 94.3% precision from the same related work with higher accuracy.
Read full abstract