Abstract

SUMMARY This paper describes a form of cross-validation, in the context of principal component analysis, which has a number of useful aspects as regards multivariate data inspection and description. Topics covered include choice of dimensionality, identification of influential observations, and selection of important variables. The methods are motivated by and illustrated on a well-known data set. 1. Data Set and Objectives Jeffers (1967) described two detailed multivariate case studies, one of which concerned 19 variables measured on each of 40 winged aphids alatee adelges) that had been caught in a light trap. The 19 variables are listed in Table 1. Principal component analysis (PCA) was used to examine the structure in the data, and if possible to answer the following questions: (i) How many dimensions of the individuals are being measured? (ii) How many distinct taxa are present in the habitat? (iii) Which variables among the 19 are redundant for distinguishing between taxa, and which must be retained in future work? Of the 19 variables, 14 are length or width measurements, four are counts, and one (anal fold) is a presence/absence variable scored 0 or 1. In view of this disparity in variable type, Jeffers elected to standardise the data and thus effect the PCA by finding the latent roots and vectors of the correlation (rather than covariance) matrix of the data. The elements of each latent vector provide the coefficients of one of 19 linear combinations of the standardised original variables that successively maximise sample variance subject to being orthogonal with each other, and the corresponding latent root is the sample variance of that linear combination. The 19 observations for each aphid were subjected to each of these 19 linear transformations to form the 19 principal component scores for that aphid. The above questions were then answered as follows: (i) The latent roots of the correlation matrix were as given in Table 1. The four largest comprise 73.0%, 12.5%, 3.9%, and 2.6%, respectively, of the total variance (19.0) of the standardised variables; the dimensionality of the data was therefore taken to be 2. (ii) When the scores of the first two principal components for the 40 aphids were plotted against orthogonal axes, the resulting 40 points divided into four groups as shown in Figure 1. Hence, four distinct species were identified for the aphids. (iii) From consideration of the size of coefficients in the first three principal components, it was concluded that only the four variables length of tibia, number of ovipositor

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.