Abstract

Abstract In data mining and pattern classification, feature extraction and representation methods are a very important step since the extracted features have a direct and significant impact on the classification accuracy. In literature, numbers of novel feature extraction and representation methods have been proposed. However, many of them only focus on specific domain problems. In this article, we introduce a novel distance-based feature extraction method for various pattern classification problems. Specifically, two distances are extracted, which are based on (1) the distance between the data and its intra-cluster center and (2) the distance between the data and its extra-cluster centers. Experiments based on ten datasets containing different numbers of classes, samples, and dimensions are examined. The experimental results using naïve Bayes, k-NN, and SVM classifiers show that concatenating the original features provided by the datasets to the distance-based features can improve classification accuracy except image-related datasets. In particular, the distance-based features are suitable for the datasets which have smaller numbers of classes, numbers of samples, and the lower dimensionality of features. Moreover, two datasets, which have similar characteristics, are further used to validate this finding. The result is consistent with the first experiment result that adding the distance-based features can improve the classification performance.

Highlights

  • Data mining has received unprecedented focus in the recent years

  • The novel distance-based features proposed in this article are examined over a number of different pattern classification problems and the distancebased features and the original features are concatenated for another new feature representation for classification

  • Since feature extraction and representation have a direct and significant impact on the classification performance, we introduce novel distance-based features to improve classification accuracy over various domain datasets

Read more

Summary

Introduction

Data mining has received unprecedented focus in the recent years. It can be utilized in analyzing a huge amount of data and finding valuable information. Pattern classification is an important research topic in the fields of data mining and machine learning. It focuses on constructing a model so that the input data can be assigned to the correct category. In this article, we introduce novel distance-based features to improve classification accuracy. The distance between a specific data and its nearest centroid and other distances between the data and other centroids should be able to provide valuable information for classification. This rest of the article is organized as follows. The PCA algorithm can be summarized in the following steps:

Literature review
Accuracy
Sample size
Support vector machines
Distances from extra-cluster center
Experiments
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.