Abstract

Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.

Highlights

  • Principal Component Analysis (PCA) (Du et al, 2012; Hsieh et al, 2010; Mehmet Korürek, 2010; Kim et al, 2009) is a popular tool for data dimensionality reduction in the presence of complex correlation structure among a large number of numerical variables

  • Our data consisted of 200 data points per heart beat with complex correlation structure that seemed ideal for preliminary PCA dimensionality reduction step before subsequent classification approach was employed

  • In this work we show a potential performance problem of classification algorithms carried out after preliminary dimensionality reduction step via PCA

Read more

Summary

Introduction

Principal Component Analysis (PCA) (Du et al, 2012; Hsieh et al, 2010; Mehmet Korürek, 2010; Kim et al, 2009) is a popular tool for data dimensionality reduction in the presence of complex correlation structure among a large number of numerical variables. In certain problems dimensionality reduction via PCA with even high cutoff for exclusion is not a good idea This phenomenon was noticed when we implementing an arrhythmia classification on ECG data, even though some of studies demonstrated the PCA application on same research (Gupta and Mittal, 2019b, 2018b; Gupta et al, 2020; Gupta and Mittal, 2018a, 2016, 2019a). This is an example revealing that PCA may not be a good idea for certain types of classification problems.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call