Conventional police databases contain much information on cybercrime, but extracting it remains a practical challenge. This is because these databases rarely contain labels that could be used to automatically retrieve all cybercrime incidents. In this article, we present a supervised machine learning method for extracting cybercrime incidents in calls for police service datasets. Data from the Korean National Police (2020, 9 months, N = 15 million call logs) is used for the demonstration. We combined methods of keyword query selection, minority oversampling, and majority voting techniques to develop a classifier. Three classification techniques, including Naïve Bayes, linear SVM, and kernel SVM, were tested, and the kernel model was chosen to build the final model (accuracy, 93.4%; F1-score, 92.4). We estimate that cybercrime only represents 4.6% of the cases in the selected dataset (excluding traffic-related incidents), but that it can be prevalent with some crime types. We found, for example, that about three quarters (76%) of all fraud incidents have a cyber dimension. We conclude that the cybercrime classification method proposed in this study can support further research on cybercrime and that it offers considerable advantages over manual or keyword-based approaches.
Read full abstract