Abstract

Multi-classification is commonly encountered in data science practice, and it has broad applications in many areas such as biology, medicine, and engineering. Variable selection in multiclass problems is much more challenging than in binary classification or regression problems. In addition to estimating multiple discriminant functions for separating different classes, we need to decide which variables are important for each individual discriminant function as well as for the whole set of functions. In this paper, we address the multi-classification variable selection problem by proposing a new form of penalty, supSCAD, which first groups all the coefficients of the same variable associated with all the discriminant functions altogether and then imposes the SCAD penalty on the supnorm of each group. We apply the new penalty to both soft and hard classification and develop two new procedures: the supSCAD multinomial logistic regression and the supSCAD multi-category support vector machine. Our theoretical results show that, with a proper choice of the tuning parameter, the supSCAD multinomial logistic regression can identify the underlying sparse model consistently and enjoys oracle properties even when the dimension of predictors goes to infinity. Based on the local linear and quadratic approximation to the non-concave SCAD and nonlinear multinomial log-likelihood function, we show that the new procedures can be implemented efficiently by solving a series of linear or quadratic programming problems. Performance of the new methods is illustrated by simulation studies and real data analysis of the Small Round Blue Cell Tumors and the Semeion Handwritten Digit data sets.

Highlights

  • Multiclass classification is an important topic in statistical machine learning and has broad applications in practice such as handwritten zip code digit recognition and cancer classification based on DNA microarray data (Hastie et al, 2009)

  • We propose and study a new class of regularization methods for variable selection in multiclass classification problems

  • This new penalty, supSCAD, enhances sparse learning in multi-classification by retaining the merits from both SCAD and supnorm penalties

Read more

Summary

Introduction

Multiclass classification is an important topic in statistical machine learning and has broad applications in practice such as handwritten zip code digit recognition and cancer classification based on DNA microarray data (Hastie et al, 2009). We propose and study a new class of learning methods for joint multiclass classification and variable selection. Hard classification rules directly target on the argmax function without estimating the conditional class probabilities, such as support vector machine (SVM, Vapnik (1995)). Modern penalized methods include LASSO (Tibshirani, 1996), adaptive LASSO (Zou, 2006), SCAD (Fan and Li, 2001), and MCP (Zhang et al, 2010) These methods can shrink small coefficients to exactly 0 and achieve sparsity in the solution. Some of these methods have been applied to binary classification, such as L1 logistic regression (Hastie et al, 2009), L1 SVM (Bradley and Mangasarian, 1998), and SCAD SVM (Zhang et al, 2006). Major technical derivations and proofs are delegated to the Appendix

The supSCAD Penalty
Quadratic Approximation for the Log-likelihood
Optimization Algorithms for supSCAD Logistic Regression
Implementation Issues
Extensions to Multiclass Support Vector Machines
Computational Algorithms
Simulation Studies
Method
Real Data Applications
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call