Abstract
This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contains labeled patterns, and the second part, which contains unlabeled patterns. In the first part, a small number of features are identified using well-known maximum relevance (from first part) and minimum redundancy (whole dataset) based feature selection approaches using the correlation coefficient. The subset of features from the identified set of features, which produces a high classification accuracy using any supervised classifier from labeled patterns, is selected for later processing. In the second phase, i.e., Phase-II, the patterns belonging to the first and second part are clustered separately into the available number of classes of the dataset. In the clusters of the first part, take the majority of patterns belonging to a cluster as the class for that cluster, which is given already. Form the pairs of cluster centroids made in the first and second part. The centroid of the second part nearest to a centroid of the first part will be paired. As the class of the first centroid is known, the same class can be assigned to the centroid of the cluster of the second part, which is unknown. The actual class of the patterns if known for the second part of the dataset can be used to test the classification accuracy of patterns in the second part. The proposed two-phase approach performs well in terms of classification accuracy and number of features selected on the given benchmarked datasets.
Highlights
Pattern classification [1] is one of the core challenging tasks [2,3] in data mining [4,5], web mining [6], bioinformatics [7], and financial forecasting [8,9]
The correlation coefficients calculated as described in the proposed scheme in Section 3 are used to find out the number of features with maximum relevance for each dataset
The fourth column shows the features with low redundancy based on minimum correlation coefficient values calculated by taking features together in pairs
Summary
Pattern classification [1] is one of the core challenging tasks [2,3] in data mining [4,5], web mining [6], bioinformatics [7], and financial forecasting [8,9]. The goal of classification [10,11] is to assign a new entity to a class from a pre-specified set of classes. The importance of pattern classification can be realized in the classification of breast cancer. There are two classes of patients, one belonging to the “benign” class, having no breast cancer, while the other class of patients belong to the “malignant” class, which shows strong evidence of breast cancer.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.