Abstract

The center method, which was first proposed in Rev. Stat. Appl. 1997 by Cazes et al. and Stat. Anal. Data Mining 2011 by Douzal-Chouakria et al., extends the well-known Principal Component Analysis (PCA) method to particular types of symbolic objects that are characterized by multivalued interval-type variables. In contrast to classical data, symbolic data have internal variation. The authors who originally proposed the center method used the center of a hyper-rectangle in as a base point to carry out PCA, followed by the projection of all vertices of the hyper-rectangles as supplementary elements. Since these publications, the center point of the hyper-rectangle has typically been assumed to be the best point for the initial PCA. However, in this paper, we show that this is not always the case, if the aim is to maximize the variance of projections or minimize the squared distance between the vertices and their respective projections. Instead, we propose the use of an optimization algorithm that maximizes the variance of the projections (or that minimizes the distances between the squares of the vertices and their respective projections) and finds the optimal point for the initial PCA. The vertices of the hyper-rectangles are, then, projected as supplementary variables to this optimal point, which we call the “Best Point” for projection. For this purpose, we propose four new algorithms and two new theorems. The proposed methods and algorithms are illustrated using a data set comprised of measurements of facial characteristics from a study on facial recognition patterns for use in surveillance. The performance of our approach is compared with that of another procedure in the literature, and the results show that our symbolic analyses provide more accurate information. Our approach can be regarded as an optimization method, as it maximizes the explained variance or minimizes the squared distance between projections and the original points. In addition, the symbolic analyses generate more informative conclusions, compared with the classical analysis in which classical surrogates replace intervals. All the methods proposed in this paper can be executed in the RSDA package developed in R.

Highlights

  • Component Analysis (PCA) method to particular types of symbolic objects that are characterized by multivalued interval-type variables

  • We propose the use of an optimization algorithm that maximizes the variance of the projections and finds the optimal point for the initial Principal Component Analysis (PCA)

  • Symbolic methods based on interval-valued variables have important advantages, such as improved computational complexity due to reduced execution times, as small data tables are used

Read more

Summary

The Center Method

Symbolic data were introduced by Diday in [1]. In contrast to classical data analysis, in which a variable takes a single value, a variable in symbolic data can take a finite or infinite set of values: For example, an interval variable can take an infinite set of numerical values that range from low to high. As Principal Component Analysis (PCA) is one of the most popular multivariate methods for dimension reduction, its extension to symbolic data is important. The authors of [6] proposed an approach that extended the classical PCA method to interval-valued data by using symbolic covariance to determine the principal component space to reflect the total variation in the interval-valued data. PCA has been extended to histogram data in a number of studies (see [7,8,9,10,11]) Most of these methods were developed for interval matrices, where an interval matrix X is defined as [ a11 , b11 ] [ a12 , b12 ] . We let D be an interval matrix, defined as aij − X(cj) bij − X(cj). The proof of this theorem can be found in [12,13]

The Best Point Method
Maximizing the Variance of the First Components
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call