Abstract

Five methods for discrimination are described, namely Euclidean Distance to centroids (EDC), Linear Discriminant Analysis (LDA) (based on the Mahalanobis distance and pooled variance covariance matrix), Quadratic Discriminant Analysis (QDA) (based on the Mahalanobis distance and individual class variance covariance matrix — non-Bayesian form), Learning Vector Quantization (LVQ) and Support Vector Machines (SVMs) (using soft boundaries and Radial Basis Functions), and illustrated graphically as boundary methods. The performance of each method was determined using four synthetic datasets each consisting of 200 samples half belonging to one of two classes, and a further two synthetic datasets containing 400 samples, again equally split between the two classes. In datasets 1 to 3, five variables were distributed multinormally, in dataset 1 the classes are distributed roughly circularly but with a significant degree of overlap, in dataset 2, the distribution is in elongated hyperellipsoids with small overlap, and in dataset 3 there is a region of complete overlap between classes. In dataset 4 two variables are distributed in a crescent shape. In datasets 5 and 6, 100 variables were generated from multinormal populations, some of which were potential discriminators, however a large proportion of the variables were designed to be uninformative. The methods were optimised using a training set and their performance evaluated using a test set: this was repeated 100 times for different test and training set splits. The average % correctly classified was computed for each class and model, as well as the model stability for each sample (the proportion of times the sample is classified into the same group over all 100 iterations). The conclusions are that the performance of the classifiers depends very much on the distribution of data. Approaches such as LVQ and SVMs that try to determine complex boundaries perform best when the data is not normally distributed such as in dataset 4, but can be prone to overfitting otherwise. QDA tends to perform best on multinormal data although it can be influenced by non-discriminative variables which show a difference in variance. It is recommended to look at the data structure prior to model building to determine the optimal type of model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call