Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Francisco J Castellanos,Jose J Valero-Mas,Jorge Calvo-Zaragoza

doi:10.1007/s00500-021-06178-2

Francisco J Castellanos, Jose J Valero-Mas + Show 1 more

Open Access

https://doi.org/10.1007/s00500-021-06178-2

Copy DOI

Journal: Soft Computing	Publication Date: Sep 2, 2021
Citations: 8	License type: open-access

Affiliation: University of Alicante

Abstract

The k-nearest neighbor (kNN) rule is one of the best-known distance-based classifiers, and is usually associated with high performance and versatility as it requires only the definition of a dissimilarity measure. Nevertheless, kNN is also coupled with low-efficiency levels since, for each new query, the algorithm must carry out an exhaustive search of the training data, and this drawback is much more relevant when considering complex structural representations, such as graphs, trees or strings, owing to the cost of the dissimilarity metrics. This issue has generally been tackled through the use of data reduction (DR) techniques, which reduce the size of the reference set, but the complexity of structural data has historically limited their application in the aforementioned scenarios. A DR algorithm denominated as reduction through homogeneous clusters (RHC) has recently been adapted to string representations but as obtaining the exact median value of a set of string data is known to be computationally difficult, its authors resorted to computing the set-median value. Under the premise that a more exact median value may be beneficial in this context, we, therefore, present a new adaptation of the RHC algorithm for string data, in which an approximate median computation is carried out. The results obtained show significant improvements when compared to those of the set-median version of the algorithm, in terms of both classification performance and reduction rates.

Highlights

In the pattern recognition (PR) field, the objective of supervised classification algorithms is to label unknown prototypes1 according to a finite set of categories by considering the knowledge automatically gathered from a reference corpus of labeled data
Since the corpora are composed of string elements, the distance-based algorithms have an extremely poor efficiency owing to the fact that they are a type of structural representation
While the k-nearest neighbor (kNN) rule constitutes one of the best-known distance-based classifiers, it is generally associated with low efficiency figures when tackling scenarios in which there are large amounts of data and computationally expensive dissimilarity metrics

Summary

Introduction

In the pattern recognition (PR) field, the objective of supervised classification algorithms is to label unknown prototypes according to a finite set of categories by considering the knowledge automatically gathered from a reference corpus of labeled data. Despite its high performance, as reported in literature, kNN is considered a low-efficiency algorithm, since all the elements in the reference corpus must be queried each time a new element needs to be classified (Yang et al 2019) This issue is of particular importance in the context of structural data owing to the large amount of time consumed in the computation of dissimilarities. This work focuses on the DR family of methods, which tackles the aforementioned performance issue in kNN by proposing policies with which to reduce the size of the reference corpus in order to compute fewer distances This reduction is typically carried out as a preprocessing stage, and the additional computation it implies that does not, increase the actual temporal cost of the classification task. PG approaches generally achieve sharper reduction rates than the PS approaches, but their applicability is considerably limited owing to the difficulty of dealing with structural domains

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Soft Computing

Lead the way for us

Similar Papers

Data reduction via multi-label prototype generation
Stefanos Ougiaroglou ... Georgios Evangelidis
Neurocomputing | VOL. 526
Stefanos Ougiaroglou, et. al.Stefanos Ougiaroglou ... Georgios Evangelidis
12 Jan 2023
Neurocomputing | VOL. 526

Integrating a differential evolution feature weighting scheme into prototype generation
Isaac Triguero ... Francisco Herrera
Neurocomputing | VOL. 97
Isaac Triguero, et. al.Isaac Triguero ... Francisco Herrera
30 Jun 2012
Neurocomputing | VOL. 97

IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification
Isaac Triguero ... Francisco Herrera
IEEE Transactions on Neural Networks | VOL. 21
Isaac Triguero, et. al.Isaac Triguero ... Francisco Herrera
28 Oct 2010
IEEE Transactions on Neural Networks | VOL. 21

Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification
Isaac Triguero ... Francisco Herrera
Pattern Recognition | VOL. 44
Isaac Triguero, et. al.Isaac Triguero ... Francisco Herrera
30 Oct 2010
Pattern Recognition | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Soft Computing