Abstract

High-dimension and low-sample-size (HDLSS) data sets have posed great challenges to many machine learning methods. To deal with practical HDLSS problems, development of new classification techniques is highly desired. After the cause of the over-fitting phenomenon is identified, a new classification criterion for HDLSS data sets, termed tolerance similarity, is proposed to emphasize maximization of within-class variance on the premise of class separability. Leveraging on this criterion, a novel linear binary classifier, termed No-separated Data Maximum Dispersion classifier (NPDMD), is designed. The main idea of the NPDMD is to spread samples of two classes in a large interval in the respective positive or negative space along the projecting direction when the distance between the projection means for two classes is large enough. The salient features of the proposed NPDMD are: (1) The NPDMD operates well on HDLSS data sets; (2) The NPDMD solves the objective function in the entire feature space to avoid the data-piling phenomenon. (3) The NPDMD leverages on the low-rank property of the covariance matrix for HDLSS data sets to accelerate the computation speed. (4) The NPDMD is suitable for different real-word applications. (5) The NPDMD can be implemented readily using Quadratic Programming. Not only theoretical properties of the NPDMD have been derived, but also a series of evaluations have been conducted on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. Experimental results and comprehensive studies demonstrate the superiority of the NPDMD in terms of correct classification rate, mean within-group correct classification rate and the area under the ROC curve.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call