The k-Nearest Neighbor (k-NN) is a well-known supervised learning algorithm. The effect of the distance used in the analysis on the k-NN performance is very important. According to Ostrowski’s theorem, there are only two nontrivial absolute values on the field of rational numbers, Q, which are the usual absolute value and the p-adic absolute value for a prime p. In view of this theorem, the p-adic absolute value motivates us to calculate the p-adic distance between two samples for the k-NN algorithm. In this study, the p-adic distance on Q was coupled with the k-NN algorithm and was applied to 10 well-known public datasets containing categorical, numerical, and mixed (both categorical and numerical) type predictive attributes. Moreover, the p-adic distance performance was compared with Euclidean, Manhattan, Chebyshev, and Cosine distances. It was seen that the average accuracy obtained from the p-adic distance ranks first in 5 out of 10 datasets. Especially, in mixed datasets, the p-adic distance gave better results than other distances. For r=1,2,3, the effect of the r-decimal values of the number for the p-adic calculation was examined on numerical and mixed datasets. In addition, the p parameter of the p-adic distance was tested with prime numbers less than 29, and it was found that the average accuracy obtained for each p was very close to each other, especially in categorical and mixed datasets. Also, it can be concluded that k-NN with the p-adic distance may be more suitable for binary classification than multi-class classification.
Read full abstract