ABSTRACT Considering the ubiquity of data, purchase and churn prediction have become crucial for data-driven decision-making. While many sectors have benefited from the abundance of data in the wake of digital transformation, the applications of predictive analytics remain limited in small or medium-sized enterprises, and journalism. Thus, this study focuses on applying such techniques in the context of start-up journalism. For this purpose, article purchase and subscription data from a digital journalistic platform of over 2,700 individual customers were analysed in multiple models using the ensemble methods of random forest and gradient boosting, and logistic regression. The findings suggest, first, that variables typically associated with the recency, frequency and monetary value of customers are central to purchase prediction, and, second, that churn prediction depends on other behavioural variables, such as time/actions per visit. Furthermore, while naïve application of the logistic regression achieves comparable results to the ensemble methods for the purchase prediction, it strongly underperforms in churn prediction, while gradient boosting performs best throughout. Beyond, optimisation for recall leads to random forest to perform best. The results demonstrate high accuracy scores and hence imply the applicability of such models to journalism. Furthermore, this paper demonstrates how to reduce classification thresholds in models to improve sensitivity for small, imbalanced data sets.
Read full abstract