Abstract

SummaryWe introduce a very general method for high dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random-projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition that is implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random-projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high dimensional classifiers via an extensive simulation study, which reveals its excellent finite sample performance.

Highlights

  • Supervised classification concerns the task of assigning an object to one of two or more groups, on the basis of a sample of labelled training data

  • Another key feature of our proposal is the realization that a simple majority vote of the classifications based on the retained projections can be highly suboptimal; instead, we argue that the voting threshold should be chosen in a data-driven fashion in an attempt to minimize the test error of the infinite simulation version of our random-projection ensemble classifier

  • For comparison we present the corresponding results of applying, where possible, the three base classifiers (LDA, Quadratic discriminant analysis (QDA), knn) in the original p-dimensional space alongside 11 other classification methods chosen to represent the state of the art

Read more

Summary

Introduction

Supervised classification concerns the task of assigning an object (or a number of objects) to one of two or more groups, on the basis of a sample of labelled training data. The problem was first studied in generality in the famous work of Fisher (1936), where he introduced some of the ideas of linear discriminant analysis (LDA) and applied them to his iris data set. Classification problems arise in a plethora of applications, including spam filtering, fraud detection, medical diagnoses, market research, natural language processing and many others. Alternative techniques include support vector machines (SVMs) (Cortes and Vapnik, 1995), tree classifiers and random forests (RFs) (Breiman et al, 1984; Breiman, 2001), kernel methods (Hall and Kang, 2005) and nearest neighbour classifiers (Fix and Hodges, 1951). More substantial overviews and detailed discussion of these techniques, and others, can be found in Devroye et al (1996) and Hastie et al (2009)

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.