The increased number of payment methods also makes it easier for personal information to be stolen by criminals, and for criminals to take over financial payment accounts and steal money. With trillions of bank card transactions occurring every day, Credit Card Fraud Detection (CCFD) is a serious challenge, so this paper predicts "whether or not fraud occurs" by using six types of machine learning models. For problem 1, firstly, "mean, maximum, minimum, median, variance, standard deviation, quartile" are calculated for each indicator; secondly, data cleaning is carried out, and the data set is found to be free of missing values and outliers. Then the data preprocessing work was carried out, min_max normalisation and z-score standardisation were performed on the data. After that, correlation analysis was carried out, and the first four indicators were classified as negative indicators and the last three as positive indicators according to the characteristics of the indicators themselves. It can be found by calculating the Pearson correlation coefficient value after two data processing. Using the coefficient of variation method to calculate the weight of the seven "influence whether fraud" indicators. Finally, BP neural network model, decision tree model, random forest classification model, ELM model, SVM model, logistic regression model are established. For Problem 2, the four models constructed in Problem 1 are solved; to solve the BP neural network model: the data set is divided into training set and testing set according to the ratio of 6:4, and the sigmod function is used as the activation function. For BP neural network, "output >0.5" is recorded as 1, i.e. fraudulent behaviour; "output <0.5" is recorded as 0, i.e. non-fraudulent behaviour. Adjusting the learning rate and the number of iterations, the optimal average mean square error after optimal gradient descent is smaller. To solve the SVM model, the data set is divided into ten groups using the improved ten-fold cross-test, with one group as the training set and nine groups as the validation set, so as to obtain the model with the highest accuracy and the corresponding training data, and then the genetic algorithm is used to search for the optimisation of the kernel parameters in the SVM model on this basis. To solve the decision tree model, the training set and prediction set are divided into 7:3 and solved, and the number of leaf nodes is optimised. Solve the random forest classification model, divided into training set and prediction set according to 7:3 and solved, for similar accuracy choose the random forest classifier when the decision tree is less.
Read full abstract