Compressed kNN: K-Nearest Neighbors with Data Compression.

Jaime Salvador-Meneses,Jose Garcia-Rodriguez,Zoila Ruiz-Chavez

doi:10.3390/e21030234

Jaime Salvador-Meneses, Jose Garcia-Rodriguez + Show 1 more

Open Access

https://doi.org/10.3390/e21030234

Copy DOI

Journal: Entropy	Publication Date: Feb 28, 2019
Citations: 37	License type: CC BY 4.0

Affiliation: Central University of Ecuador

Abstract

The kNN (k-nearest neighbors) classification algorithm is one of the most widely used non-parametric classification methods, however it is limited due to memory consumption related to the size of the dataset, which makes them impractical to apply to large volumes of data. Variations of this method have been proposed, such as condensed KNN which divides the training dataset into clusters to be classified, other variations reduce the input dataset in order to apply the algorithm. This paper presents a variation of the kNN algorithm, of the type structure less NN, to work with categorical data. Categorical data, due to their nature, can be compressed in order to decrease the memory requirements at the time of executing the classification. The method proposes a previous phase of compression of the data to then apply the algorithm on the compressed data. This allows us to maintain the whole dataset in memory which leads to a considerable reduction of the amount of memory required. Experiments and tests carried out on known datasets show the reduction in the volume of information stored in memory and maintain the accuracy of the classification. They also show a slight decrease in processing time because the information is decompressed in real time (on-the-fly) while the algorithm is running.

Highlights

Discrete data compression is an interesting problem especially when compressed data is required to maintain the characteristics of the original data [1]
This section describes, in general, the process of data classification focusing on the k-nearest neighbors (kNN) method
The algorithm used corresponds to kNN with a value of k between five to 20 (k = 5, 10, 15, 20), which by default uses an Euclidean metric EuclideanDistance that operates with real numbers

Summary

Introduction

Discrete data compression is an interesting problem especially when compressed data is required to maintain the characteristics of the original data [1]. The number of attributes ( called the dimension) is large, and many algorithms do not work well with datasets that have a high dimension because they require all information to be stored in memory prior to processing. The predominant characteristic of this type of information is that most of the data is of categorical type. This section describes, in general, the process of data classification focusing on the kNN method (the algorithm is presented). The kNN algorithm belongs to the family of methods known as instance based methods. These methods are based on the principle that observations (instances) within a dataset are usually placed close to other observations that have similar attributes [11]. There are two types of kNN algorithms [10]:

Objectives

Results

Conclusion