Dimensionality reduction (DR) techniques are becoming increasingly important in engineering and scientific applications for feature extraction and data visualization purposes. This problem gets significantly more demanding when encountering class-imbalanced data sets with a disproportionate ratio of observations in each class. The main objective of this paper is to thoroughly evaluate the performance of various dimensionality reduction techniques for analyzing a cognate set of computational and experimental data in the field of structural and earthquake engineering with varying numbers of classes. Notably, a data set is generated using a computer simulation that characterizes risks posed by earthquakes to structures. Depending on the severity of damage in the simulations, data is classified into three, five, and eight classes. This research study also uses three existing multi-class imbalanced data sets that have been developed independently. For each problem, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-Distributed Stochastic Neighbor Embedding (t-SNE) are used, to cover a wide range of linear/non-linear and supervised/unsupervised DR techniques. To overcome the class-imbalance problem, the efficacy of the Synthetic Minority Oversampling (SMOTE) technique considering various classifiers, including K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Trees (DT), Random Forest (RF), MLPClassifier (MLP), and AdaBoost are investigated. It is observed that combining SMOTE and LDA shows promising results for training classifiers on the reduced data. In light of the findings of this study, it is recommended to train DT classifiers after reducing the dimensionality of the input data. It is through this process that the scientific data is transformed into a two-dimensional space for visualizing decision surfaces, which is helpful for practitioners to understand the performance of classifiers.
Read full abstract