K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS

N I Boyko,V Yu Mykhailyshyn

doi:10.15588/1607-3274-2023-3-9

Abstract

Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method. Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy. Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score. Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model. Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.

Full Text