Abstract

Nearest neighbor algorithms like k-nearest neighbors (kNN) are fundamental supervised learning techniques to classify a query instance based on class labels of its neighbors. However, quite often, huge volumes of datasets are not fully labeled and the unknown probability distribution of the instances may be uneven. Moreover, kNN suffers from challenges like curse of dimensionality, setting the optimal number of neighbors, and scalability for high-dimensional data. To overcome these challenges, we propose an improvised approach of classification via depth representation of subspace clusters formed from high-dimensional data. We offer a consistent and principled approach to dynamically choose the nearest neighbors for classification of a query point by i) identifying structures and distributions of data; ii) extracting relevant features, and iii) deriving an optimum value of k depending on the structure of data by representing data using data depth function. We propose an improvised classification algorithm using a depth-based representation of clusters, to improve performance in terms of execution time and accuracy. Experimentation on real-world datasets reveals that proposed approach is at least two orders of magnitude faster for high-dimensional dataset and is at least as accurate as traditional kNN.

Highlights

  • The k-nearest neighbors algorithm is a simple nonparametric classification technique which is efficient, provided it is given a good distance metric and has enough labeled training data [1]

  • We offer a consistent and principled approach to dynamically choose the nearest neighbors for classification of a query point by i) identifying structures and distributions of data, ii) extracting relevant features, iii) deriving an optimum value of k depending on the structure of data by representing data using data depth function

  • K is the total number of clusters, tAj i is Ai dimensional instance assigned to Si and mi is the medoid of cluster Si . au,i denotes value of uth attribute of mi

Read more

Summary

Introduction

The k-nearest neighbors (kNN) algorithm is a simple nonparametric classification technique which is efficient, provided it is given a good distance metric and has enough labeled training data [1]. If we have no knowledge of the underlying distribution, a decision to classify data point x into a class y depends on a set of known samples (x1, y1), ..., (xn, yn). Available data points may not be labelled fully. The existing techniques for classification of such datasets are either not scalable or not accurate due to failure in identifying intricate relationships existing amongst the features of data. Traditional algorithms suitable for low-dimensional data, need to be improvised to suit the existing scenarios

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.