Abstract
The k-nearest neighbor (kNN) rule is one of the best-known distance-based classifiers, and is usually associated with high performance and versatility as it requires only the definition of a dissimilarity measure. Nevertheless, kNN is also coupled with low-efficiency levels since, for each new query, the algorithm must carry out an exhaustive search of the training data, and this drawback is much more relevant when considering complex structural representations, such as graphs, trees or strings, owing to the cost of the dissimilarity metrics. This issue has generally been tackled through the use of data reduction (DR) techniques, which reduce the size of the reference set, but the complexity of structural data has historically limited their application in the aforementioned scenarios. A DR algorithm denominated as reduction through homogeneous clusters (RHC) has recently been adapted to string representations but as obtaining the exact median value of a set of string data is known to be computationally difficult, its authors resorted to computing the set-median value. Under the premise that a more exact median value may be beneficial in this context, we, therefore, present a new adaptation of the RHC algorithm for string data, in which an approximate median computation is carried out. The results obtained show significant improvements when compared to those of the set-median version of the algorithm, in terms of both classification performance and reduction rates.
Highlights
In the pattern recognition (PR) field, the objective of supervised classification algorithms is to label unknown prototypes1 according to a finite set of categories by considering the knowledge automatically gathered from a reference corpus of labeled data
Since the corpora are composed of string elements, the distance-based algorithms have an extremely poor efficiency owing to the fact that they are a type of structural representation
While the k-nearest neighbor (kNN) rule constitutes one of the best-known distance-based classifiers, it is generally associated with low efficiency figures when tackling scenarios in which there are large amounts of data and computationally expensive dissimilarity metrics
Summary
In the pattern recognition (PR) field, the objective of supervised classification algorithms is to label unknown prototypes according to a finite set of categories by considering the knowledge automatically gathered from a reference corpus of labeled data. Despite its high performance, as reported in literature, kNN is considered a low-efficiency algorithm, since all the elements in the reference corpus must be queried each time a new element needs to be classified (Yang et al 2019) This issue is of particular importance in the context of structural data owing to the large amount of time consumed in the computation of dissimilarities. This work focuses on the DR family of methods, which tackles the aforementioned performance issue in kNN by proposing policies with which to reduce the size of the reference corpus in order to compute fewer distances This reduction is typically carried out as a preprocessing stage, and the additional computation it implies that does not, increase the actual temporal cost of the classification task. PG approaches generally achieve sharper reduction rates than the PS approaches, but their applicability is considerably limited owing to the difficulty of dealing with structural domains
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.