Principal Component Analysis

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Principal component analysis (PCA) is often applied for analyzing data in the most diverse areas. This work reports, in an accessible and integrated manner, several theoretical and practical aspects of PCA. The basic principles underlying PCA, data standardization, possible visualizations of the PCA results, and outlier detection are subsequently addressed. Next, the potential of using PCA for dimensionality reduction is illustrated on several real-world datasets. Finally, we summarize PCA-related approaches and other dimensionality reduction techniques. All in all, the objective of this work is to assist researchers from the most diverse areas in using and interpreting PCA.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 132
  • 10.3390/rs12071206
Hyperspectral Estimation of Soil Organic Matter Content using Different Spectral Preprocessing Techniques and PLSR Method
  • Apr 8, 2020
  • Remote Sensing
  • Lanzhi Shen + 6 more

Soil organic matter (SOM) is the main source of soil nutrients, which are essential for the growth and development of agricultural crops. Hyperspectral remote sensing is one of the most efficient ways of estimating the SOM content. Visible, near infrared, and mid-infrared reflectance spectroscopy, combined with the partial least squares regression (PLSR) method is considered to be an effective way of determining soil properties. In this study, we used 54 different spectral pretreatments to preprocess soil spectral data. These spectral pretreatments were composed of three denoising methods, six data transformations, and three dimensionality reduction methods. The three denoising methods included no denoising (ND), Savitzky–Golay denoising (SGD), and wavelet packet denoising (WPD). The six data transformations included original spectral data, R; reciprocal, 1/R; logarithmic, log(R); reciprocal logarithmic, log(1/R); first derivative, R’; and first derivative of reciprocal, (1/R)’. The three dimensionality reduction methods included no dimensionality reduction (NDR), sensitive waveband dimensionality reduction (SWDR), and principal component analysis (PCA) dimensionality reduction (PCADR). The processed spectra were then employed to construct PLSR models for predicting the SOM content. The main results were as follows—(1) the wavelet packet denoising (WPD)-R’ and WPD-(1/R)’ data showed stronger correlations with the SOM content. Furthermore, these methods could effectively limit the correlation between the adjacent bands and, thus, prevent “overfitting”. (2) Of the 54 pretreatments investigated, WPD-(1/R)’-PCADR yielded the model with the highest accuracy and stability. (3) For the same denoising method and spectral transformation data, the accuracy of the SOM content estimation model based on SWDR was higher than that of the model based on NDR. Furthermore, the accuracy in the case of PCADR was higher than that for SWDR. (4) Dimensionality reduction was effective in preventing data overfitting. (5) The quality of the spectral data could be improved and the accuracy of the SOM content estimation model could be enhanced effectively, by using some appropriate preprocessing methods (one combining WPD and PCADR in this study).

  • Research Article
  • Cite Count Icon 12
  • 10.1080/07038992.2016.1175928
Soil Moisture Retrieval over a Semiarid Area by Means of PCA Dimensionality Reduction
  • Mar 3, 2016
  • Canadian Journal of Remote Sensing
  • Xiang Zhang + 4 more

The main objective of this study is to develop a multifeature soil moisture retrieval method based on the principal component analysis (PCA) dimensionality reduction technique. RADARSAT-2 data were used to compute the backscattering coefficients and polarimetric variables. The optimal input features for soil moisture retrieval were selected by means of PCA dimensionality reduction and least root mean square error (RMSE) criterion. The support vector regression (SVR) model was used to estimate soil moisture content. The results indicated that the optimal features extracted by the PCA dimensionality reduction showed high correlation with soil moisture content. The RMSE, R2 (determination coefficient) and mean relative error (MRE) were (1.4 vol.%, 0.73, 18.2%) and (1.6 vol.%, 0.66, 15.6%) over the low grass cover areas A and B, respectively. For the bare soil areas A and B, the statistic results were (1.3 vol.%, 0.76, 12.1%) and (1.6 vol.%, 0.72, 14.9%), respectively. This case study confirmed the potential of the developed approach to estimate soil moisture over the low grass cover and bare soil areas.

  • Research Article
  • Cite Count Icon 8
  • 10.3390/w15040701
Fast Identification Method of Mine Water Source Based on Laser-Induced Fluorescence Technology and Optimized LSTM
  • Feb 10, 2023
  • Water
  • Pengcheng Yan + 5 more

There is a great threat to the production safety of coal mines caused by mine water disasters. Traditional identification methods are not adapted to the efficiency of today’s coal mining and do not offer the advantage of accurate detection in real-time. In this study, the Mayfly Algorithm (MA) was used to optimize the Long Short-Term Memory (LSTM) network, combined with laser-induced fluorescence technology, to apply it to the identification of mine water sources for the prevention of mine water disasters and post-disaster relief work. Taking sandstone water and goaf water as the original samples, five mixed water samples were also prepared by mixing the sandstone water and goaf water in different proportions, giving a total of seven water samples to be tested. Laser-induced fluorescence technology was used to obtain the fluorescence spectral data of water samples, and then the Linear Discriminant Analysis (LDA) dimensionality reduction algorithm and the Principal Component Analysis (PCA) dimensionality reduction algorithm were used to reduce the dimensions of the original spectral data. Then, three architectures, including LSTM, GA-LSTM (optimization of the LSTM by genetic algorithm) and MA-LSTM were designed to identify mine water sources. Finally, from the results’ analysis, MA-LSTM performs best in many aspects after PCA dimensionality reduction and has the best identification effect. These results supported the feasibility of the novel method.

  • Research Article
  • Cite Count Icon 2
  • 10.1155/2024/6648925
Research on Demagnetization Fault Diagnosis Method of Mine Cutting Permanent Magnet Synchronous Motor
  • Jan 12, 2024
  • International Journal of Rotating Machinery
  • Guo Ye + 3 more

To give timely and accurate diagnosis in the early stage of demagnetization failure for effective control and treatment, based on wavelet packet analysis, principal component analysis (PCA) dimensionality reduction, and least squares support vector machine(LSSVM), the extraction of features and the classification of demagnetization faults are completed. Since it is difficult to collect real data sets of demagnetization faults in practice, a two-dimensional finite element simulation model of permanent magnet synchronous motor (PMSM) under uniform demagnetization and partial demagnetization faults is established based on the Maxwell simulation platform. The wavelet packet analysis is used to extract the demagnetization feature of the A-phase current of the PMSM. Based on PCA dimensionality reduction, the dimensionality reduction of fault features is realized. The LSSVM is used to identify the fault and complete the fault classification. The simulation results show that the method has a high classification accuracy rate for demagnetization faults.

  • Research Article
  • Cite Count Icon 2
  • 10.1155/2021/4300059
Retracted] Matching Subsequence Music Retrieval in a Software Integration Environment
  • Jan 1, 2021
  • Complexity
  • Zhencong Li + 2 more

This paper firstly introduces the basic knowledge of music, proposes the detailed design of a music retrieval system based on the knowledge of music, and analyzes the feature extraction algorithm and matching algorithm by using the features of music. Feature extraction of audio data is the important research of this paper. In this paper, the main melody features, MFCC features, GFCC features, and rhythm features, are extracted from audio data and a feature fusion algorithm is proposed to achieve the fusion of GFCC features and rhythm features to form new features under the processing of principal component analysis (PCA) dimensionality reduction. After learning the main melody features, MFCC features, GFCC features, and rhythm features, based on the property that PCA dimensionality reduction can effectively reduce noise and improve retrieval efficiency, this paper proposes vector fusion by dimensionality reduction of GFCC features and rhythm features. The matching retrieval of audio features is an important task in music retrieval. In this paper, the DTW algorithm is chosen as the main algorithm for retrieving music. The classification retrieval of music is also achieved by the K‐nearest neighbor algorithm. In this paper, after implementing the research and improvement of algorithms, these algorithms are integrated into the system to achieve audio preprocessing, feature extraction, feature postprocessing, and matching retrieval. This article uses 100 different kinds of MP3 format music as the music library and randomly selects 4 pieces each time, and it tests the system under different system parameters, recording duration, and environmental noise. Through the research of this paper, the efficiency of music retrieval is improved and theoretical support is provided for the design of music retrieval software integration system.

  • Research Article
  • Cite Count Icon 72
  • 10.1016/j.bspc.2016.12.017
Evaluation of effect of unsupervised dimensionality reduction techniques on automated arrhythmia classification
  • Jan 10, 2017
  • Biomedical Signal Processing and Control
  • Rekha Rajagopal + 1 more

Evaluation of effect of unsupervised dimensionality reduction techniques on automated arrhythmia classification

  • Research Article
  • Cite Count Icon 1
  • 10.17762/turcomat.v12i2.1433
Comparative Analysis of Machine Learning Techniques with Principal Component Analysis on Kidney and Heart Disease
  • Apr 10, 2021
  • Turkish Journal of Computer and Mathematics Education (TURCOMAT)
  • Reena Chandra, Et Al

Detection of disease at earlier stages is the most challenging one. Datasets of different diseases are available online with different number of features corresponding to a particular disease. Many dimensionality reduction and feature extraction techniques are used nowadays to reduce the number of features in dataset and finding the most appropriate ones. This paper explores the difference in performance of different machine learning models using Principal Component Analysis dimensionality reduction technique on the datasets of Chronic kidney disease and Cardiovascular disease. Further, the authors apply Logistic Regression, K Nearest Neighbour, Naïve Bayes, Support Vector Machine and Random Forest Model on the datasets and compare the performance of the model with and without PCA. A key challenge in the field of data mining and machine learning is building accurate and computationally efficient classifiers for medical applications. With an accuracy of 100% in chronic kidney disease and 85% for heart disease, KNN classifier and logistic regression were revealed to be the most optimal method of predictions for kidney and heart disease respectively.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icesc51422.2021.9533011
Comparative Analysis of Machine Learning Techniques with Principal Component Analysis on Kidney and Heart Disease
  • Aug 4, 2021
  • Reena Chandra + 2 more

Detection of disease at earlier stages is the most challenging one. Datasets of different diseases are available online with different number of features corresponding to a particular disease. Many dimensionality reduction and feature extraction techniques are used nowadays to reduce the number of features in dataset and finding the most appropriate ones. This paper explores the difference in performance of different machine learning models using Principal Component Analysis dimensionality reduction technique on the datasets of Chronic kidney disease and Cardiovascular disease. Further, the authors apply Logistic Regression, K Nearest Neighbour, Naïve Bayes, Support Vector Machine and Random Forest Model on the datasets and compare the performance of the model with and without PCA. A key challenge in the field of data mining and machine learning is building accurate and computationally efficient classifiers for medical applications. With an accuracy of 100% in chronic kidney disease and 85% for heart disease, KNN classifier and logistic regression were revealed to be the most optimal method of predictions for kidney and heart disease respectively.

  • Research Article
  • 10.17671/gazibtd.1484037
Enhancing Skin Cancer Diagnosis through the Integration of Deep Learning and Machine Learning Approaches
  • Oct 31, 2024
  • Bilişim Teknolojileri Dergisi
  • Yahya Doğan + 1 more

Skin cancer is a disease characterized by the uncontrolled proliferation of skin cells, typically manifesting as lesions or abnormal growths. Early diagnosis is critical for improving treatment outcomes. This study proposes an innovative approach to skin cancer diagnosis by integrating modern deep learning models with traditional machine learning algorithms. A three-phase methodology was developed. In the first phase, meaningful features were extracted from skin lesion images using various transfer learning models, including Xception, VGG16, ResNet152V2, InceptionV3, InceptionResNetV2, MobileNetV2, EfficientNetB2, and DenseNet201. In the second phase, dimensionality reduction was performed using Principal Component Analysis (PCA). In the final phase, the reduced feature sets were classified using K-Nearest Neighbors (KNN) and Random Forest (RF) algorithms. Experimental results demonstrated that the highest accuracy of 91.28% was achieved through the combination of DenseNet201 for feature extraction, PCA for dimensionality reduction, and Random Forest for classification. These findings highlight the effectiveness of integrating transfer learning models, dimensionality reduction techniques, and machine learning algorithms in enhancing the accuracy of skin cancer diagnosis.

  • Conference Article
  • 10.1109/iccece51280.2021.9342094
A case study by using Python to implement data and dimensionality reduction
  • Jan 15, 2021
  • Huang Chih-Chien + 4 more

The purpose of this study is to research and explore the Data Dimension, and propose the data feature & selection of dimensionality reduction technique, in order to help users understand the impact and meaning between dimensionality reduction parameters and data dimension, thereby strengthening the use of dimension reduction algorithm. In previous studies, many scholars have proposed dimensionality reduction algorithms for various data types, such as Multi-Dimensional Scaling (MDS), Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), Facet Analysis (FA), Isometric Feature Maps (Isomap, using for manifold analysis), Local Linear Embedding (LLE), and Laplacian feature maps (Laplacian Eigenmaps). Most of these algorithms do not need to set parameters, and it has been obtained during the experiment that the selection of parameters has no visual analysis effect on the dataset in this experiment, and should be determined according to the feature of the dataset. This study is conducted by comparing the most used PCA and LDA dimensionality reduction techniques, as well as the analysis of merging other similarity methods while using MDS to process mixed data.

  • Research Article
  • Cite Count Icon 17
  • 10.5589/m08-007
Evaluation and comparison of dimensionality reduction methods and band selection
  • Jan 1, 2008
  • Canadian Journal of Remote Sensing
  • Guangyi Chen + 1 more

For dimensionality reduction (DR) of a hyperspectral data cube or band selection, it is desirable to have one method that is suitable for all remote sensing applications. However, in reality this is not possible. A specific remote sensing application requires a specific DR or band selection method that best suits it. In this paper, the evaluation and comparison of three DR methods‐namely, principal component analysis (PCA), wavelet, and minimum noise fraction (MNF)‐and one band selection method were conducted. Based on the experiments, the following was observed. For endmember extraction, the PCA DR, wavelet DR, and band selection found all five endmembers. However, the MNF DR missed one endmember. For mineral detection, the MNF DR produced a map that is closest to the true map when compared with the other DR methods and band selection method. For classification, the PCA DR produced the highest classification rates whereas the other methods yielded less classification rates.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/iscis.2007.4456865
Clustering and dimensionality reduction to determine important software quality metrics
  • Nov 1, 2007
  • Metin Turan + 1 more

During the last two decades research on software engineering is concentrated on quality. The best approach to quality evaluation goes through determining well-defined metrics on software properties. One such property is module complexity, which is a view of the software that is related to how easily it can be modified. There has been work on constructing a metrics domain which measures the module complexity. Generally, PCA (Principal Component Analysis) is used for defining principal metrics in the domain. Since there are usually no labels for the software data, an unsupervised dimensionality reduction technique, such as PCA needs to be used for determining the most important metrics. In this study, we use the clustering similarity obtained when a certain subset of metrics and when the whole set of metrics are used, to determine the most important metrics. We measure the relative difference/similarity between clusterings using three different indices, namely Rand, Jaccard and Fowlkes-Mallow. We use both backward feature selection and PCA for dimensionality reduction. On the publicly available NASA data, we find out that instead of the whole set of 42 metrics, using only 15 dimensions, we get almost the same clustering performance. Therefore, instead of the whole set of software metrics, a smaller number of them could be used to evaluate the software quality.

  • Research Article
  • 10.2478/pjmpe-2023-0015
Complexity analysis of VMAT prostate plans: insights from dimensionality reduction and information theory techniques
  • Jul 29, 2023
  • Polish Journal of Medical Physics and Engineering
  • Efstathios Kamperis + 7 more

Introduction: Volumetric Modulated Arc Therapy (VMAT) is a state-of-the-art prostate cancer treatment, defined by high dose gradients around targets. Its unique dose shaping incurs hidden complexity, impacting treatment deliverability, carcinogenesis, and machine strain. This study compares various aperture-based VMAT complexity indices in prostate cases using principal component and mutual information analyses. It suggests essential properties for an ideal complexity index from an information-theoretic viewpoint. Material and methods: The following ten complexity indices were calculated in 217 VMAT prostate plans: circumference over area (CoA), edge metric (EM), equivalent square field (ESF), leaf travel (LT), leaf travel modulation complexity score for VMAT (LTMCSV), mean-field area (MFA), modulation complexity score (standard MCS and VMAT variant MCSV), plan irregularity (PI), and small aperture score (SAS5mm). Principal component analysis (PCA) was applied to explore the correlations between the metrics. The differential entropy of all metrics was also calculated, along with the mutual information for all 45 metric pairs. Results: Whole-pelvis plans had greater complexity across all indices. The first three principal components explained 96.2% of the total variance. The complexity metrics formed three groups with similar conceptual characteristics, particularly ESF, LT, MFA, PI, and EM, SAS5mm. The differential entropy varied across the complexity metrics (PI having the smallest vs. EM the largest). Mutual information analysis (MIA) confirmed some metrics’ interdependence, although other pairs, such as LTMCSV/SAS5mm, LT/MCSV, and EM/SAS5mm, were found to share minimal MI. Conclusions: There are many complexity indices for VMAT described in the literature. PCA and MIA analyses can uncover significant overlap among them. However, this is not entirely reducible through dimensionality reduction techniques, suggesting that there also exists some reciprocity. When designing predictive models of quality assurance metrics, PCA and MIA may prove useful for feature engineering.

  • Peer Review Report
  • 10.7554/elife.80063.sa2
Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
  • Nov 28, 2022
  • Vasileios Karageorgiou + 3 more

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Data availability References Decision letter Author response Article and author information Metrics Abstract Multivariable Mendelian randomisation (MVMR) is an instrumental variable technique that generalises the MR framework for multiple exposures. Framed as a regression problem, it is subject to the pitfall of multicollinearity. The bias and efficiency of MVMR estimates thus depends heavily on the correlation of exposures. Dimensionality reduction techniques such as principal component analysis (PCA) provide transformations of all the included variables that are effectively uncorrelated. We propose the use of sparse PCA (sPCA) algorithms that create principal components of subsets of the exposures with the aim of providing more interpretable and reliable MR estimates. The approach consists of three steps. We first apply a sparse dimension reduction method and transform the variant-exposure summary statistics to principal components. We then choose a subset of the principal components based on data-driven cutoffs, and estimate their strength as instruments with an adjusted F-statistic. Finally, we perform MR with these transformed exposures. This pipeline is demonstrated in a simulation study of highly correlated exposures and an applied example using summary data from a genome-wide association study of 97 highly correlated lipid metabolites. As a positive control, we tested the causal associations of the transformed exposures on coronary heart disease (CHD). Compared to the conventional inverse-variance weighted MVMR method and a weak instrument robust MVMR method (MR GRAPPLE), sparse component analysis achieved a superior balance of sparsity and biologically insightful grouping of the lipid traits. Editor's evaluation This paper investigated the identification of causal risk factors on health outcomes. It applies sparse dimension reduction methods on highly correlated traits in the Mendelian randomization framework. The implementation of this method helps to identify risk factors when given high dimensional traits data. https://doi.org/10.7554/eLife.80063.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Mendelian randomisation (MR) is an epidemiological study design that uses genetic variants as instrumental variables (IVs) to investigate the causal effect of a genetically predicted exposure on an outcome of interest (Smith and Ebrahim, 2003). In a randomised controlled trial (RCT) the act of randomly allocating patients to different treatment groups precludes the existence of systematic confounding between the treatment and outcome and therefore provides a strong basis for causal inference. Likewise, the alleles that determine a small proportion of variation of the exposure in MR are inherited randomly. We can therefore view the various genetically proxied levels of a lifelong modifiable exposure as a 'natural' RCT, avoiding the confounding that hinder traditional observational associations. Genetically predicted levels of an exposure are also less likely to be affected by reverse causation, as genetic variants are allocated before the onset of the outcomes of interest. When evidence suggests that multiple correlated phenotypes may contribute to a health outcome, multivariable MR (MVMR), an extension of the basic univariable approach can disentangle more complex causal mechanisms and shed light on mediating pathways. Following the analogy with RCTs, the MVMR design is equivalent to a factorial trial, in which patients are simultaneously randomised to different combinations of treatments (Burgess and Thompson, 2015). An example of this would be investigation into the effect of various lipid traits on coronary heart disease (CHD) risk (Burgess and Harshfield, 2016). While MVMR can model correlated exposures, it performs suboptimally when there are many highly correlated exposures due to multicollinearity in their genetically proxied values. This can be equivalently understood as a problem of conditionally weak instruments (Sanderson et al., 2019) that is only avoided if the genetic instruments are strongly associated with each exposure conditionally on all the other included exposures. An assessment of the extent to which this assumption is satisfied can be made using the conditional F-statistic, with a value of 10 for all exposures being considered sufficiently strong (Sanderson et al., 2019). In settings when multiple highly correlated exposures are analysed, a set of genetic instruments are much more likely to be conditionally weak instruments. In this event, causal estimates can be subject to extreme bias and are therefore unreliable. Estimation bias can be addressed to a degree by fitting weak instrument robust MVMR methods (Sanderson et al., 2020; Wang et al., 2021), but at the cost of a further reduction in precision. Furthermore, MVMR models investigate causal effects for each individual exposure, under the assumption that it is possible to intervene and change each one whilst holding the others fixed. In the high-dimensional, highly correlated exposure setting, this is potentially an unachievable intervention in practice. Our aim in this paper is instead to use dimensionality reduction approaches to concisely summarise a set of highly correlated genetically predicted exposures into a smaller set of independent principal components (PCs). We then perform MR directly on the PCs, thereby estimating their effect on health outcomes of interest. We additionally suggest employing sparsity methods to reduce the number of exposures that contribute to each PC, in order to improve their interpretability in the resulting factors. Using summary genetic data for multiple highly correlated lipid fractions and CHD (Kettunen et al., 2016; Nelson et al., 2017), we first illustrate the pitfalls encountered by the standard MVMR approach. We then apply a range of sparse principal component analysis (sPCA) methods within an MVMR framework to the data. Finally, we examine the comparative performance of the sPCA approaches in a detailed simulation study, in a bid to understand which ones perform best in this setting. Results Workflow overview Our proposed analysis strategy is presented in Figure 1. Using summary statistics for the single-nucleotide polymorphism (SNP)-exposure (γ^) and SNP-outcome (Γ^) association estimates, where γ^ (dimensionality 148 SNPs× 97 exposures) exhibits strong correlation, we initially perform a PCA on γ^. Additionally, we perform multiple sPCA modalities that aim to provide sparse loadings that are more interpretable (block 3, Figure 1). The choice of the number of PCs is guided by permutation testing or an eigenvalue threshold. Finally, the PCs are used in place of γ^ in an IVW MVMR meta-analysis to obtain an estimate of the causal effect of the PC on the outcome. Similar to PC regression and in line with unsupervised methods, the outcome (SNP-outcome associations (Γ^) and corresponding standard error (S⁢EΓ^)) is not transformed by PCA and is used in the second-step MVMR in the original scale. In the real data application and in the simulation study, the best balance of sparsity and statistical power was observed for the method of sparse component analysis (SCA) (Chen and Rohe, 2021). This favoured method and the related steps are coded in an R function and are available at GitHub (https://github.com/vaskarageorg/SCA_MR/, copy archived at Karageorgiou, 2023). Figure 1 Download asset Open asset Proposed workflow. Step 1: MVMR on a set of highly correlated exposures. Each genetic variant contributes to each exposure. The high correlation is visualised in the similarity of the single-nucleotide polymorphism (SNP)-exposure associations in the correlation heatmap (top right). Steps 2 and 3: PCA and sparse PCA on γ^. Step 4. MVMR analysis on a low dimensional set of principal components (PCs). X: exposures; Y: outcome; k: number of exposures; PCA: principal component analysis; MVMR: multivariable Mendelian randomisation. UVMR and MVMR A total of 66 traits were associated with CHD at or below the Bonferroni-corrected level (p=0.05/97, Table 1). Two genetically predicted lipid exposures (M.HDL.C, M.HDL.CE) were negatively associated with CHD and 64 were positively associated (Table 3). In an MVMR model including only the 66 Bonferroni-significant traits, fitted with the purpose of illustrating the instability of IVW-MVMR in conditions of severe collinearity, conditional F-statistic (CFS) (Materials and methods) was lower than 2.2 for all exposures (with a mean of 0.81), highlighting the severe weak instrument problem. In Appendix 1—figure 3, the MVMR estimates are plotted against the corresponding univariable MR (UVMR) estimates. We interpret the reduction in identified effects as a result of the drop in precision in the MVMR model (variance inflation). Only the independent causal estimate for ApoB reached our pre-defined significance threshold and was less precise (ORMVMR (95% CI): 1.031⁢(1.012,1.37), ORUVMR (95% CI): 1.013⁢(1.01,1.016) (Appendix 1—figure 4). We note that, for M.LDL.PL, the UVMR estimate (1.52⁢(1.35,1.71), p < 10-10)) had an opposite sign to the MVMR estimate (ORMVMR=0.905(0.818,1.001)). To see if the application of a weak instrument robust MVMR method could improve the analysis, we applied MR GRAPPLE (Wang et al., 2021). As the GRAPPLE pipeline suggests, the same three-sample MR design described above is employed. In the external selection GWAS study (GLGC), a total of 148 SNPs surpass the genome-wide significance level for the 97 exposures and were used as instruments. Although the method did not identify any of the exposures as significant at nominal or Bonferroni-adjusted significance level, the strongest association among all exposures is ApoB. Table 1 Univariable Mendelian randomisation (MR) results for the Kettunen dataset with coronary heart disease (CHD) as the outcome. Positive: positive causal effect on CHD risk; Negative: negative causal effect on CHD risk. PositiveNegativeVLDLAM.VLDL.C, M.VLDL.CE, M.VLDL.FC, M.VLDL.L,M.VLDL.P, M.VLDL.PL, M.VLDL.TG, XL.VLDL.L,XL.VLDL.PL, XL.VLDL.TG, XS.VLDL.L, XS.VLDL.P, XS.VLDL.PL,XS.VLDL.TG, XXL.VLDL.L, XXL.VLDL.PL,L.VLDL.C, L.VLDL.CE, L.VLDL.FC, L.VLDL.L, L.VLDL.P,L.VLDL.PL, L.VLDL.TG, SVLDL.C, S.VLDL.FC,S.VLDL.L, S.VLDL.P, S.VLDL.PL, S.VLDL.TGNoneLDLALDL.C, L.LDL.C, L.LDL.CE, L.LDL.FC, L.LDL.L, L.LDL.P, L.LDL.PL,M.LDL.C, M.LDL.CE, M.LDL.L, M.LDL.P,M.LDL.PL, S.LDL.C, S.LDL.L, S.LDL.PNoneHDLS.HDL.TG, XL.HDL.TGM.HDL.C, M.HDL.CE PCA Standard PCA with no sparsity constraints was used as a benchmark. PCA estimates a square loadings matrix of coefficients with dimension equal to the number of genetically proxied exposures K. The coefficients in the first column define the linear combination of exposures with the largest variability (PC1). Column 2 defines PC2, the linear combination of exposures with the largest variability that is also independent of PC1, and so on. This way, the resulting factors seek to reduce redundant information and project highly correlated SNP-exposure associations to the same PC. In PC1, very low-density lipoprotein (VLDL)- and low-density lipoprotein (LDL)-related traits were the major contributors (Figure 2a). ApoB received the 8th largest loading (0.1371, maximum was 0.1403 for cholesterol content in small VLDL) and LDL.C received the 48th largest (0.1147). In PC2, high-density lipoprotein (HDL)-related traits were predominant. The first 18 largest positive loadings are HDL-related and 12 describe either large or extra-large HDL traits. PC3 received its scores mainly from VLDL traits. Six components were deemed significant through the permutation-based approach (Figure 1, Materials and methods). Figure 2 Download asset Open asset Heatmaps for the loadings matrices in the Kettunen dataset for all methods (one with no sparsity constraints [a], four with sparsity constraints under different assumptions [b–e]). The number of the exposures plotted on the vertical axis is smaller than K=97 as the exposures that do not contribute to any of the sparse principal components (PCs) have been left out. Blue: positive loading; red: negative loading; yellow: zero. In the second-step IVW regression (step 4 in Figure 1), MVMR results are presented. A modest yet precise (OR = 1.002⁢(1.0015,1.0024), p<10−10) association of PC1 with CHD was observed. Conversely, PC3 was marginally significant for CHD at the 5% level (OR = 0.998 (0.998, 0.999), p=0.049). Since γ^ has been transformed with linear coefficients (visualised in loadings matrix, Figure 2), the underlying causal effects are also transformed and interpreting the magnitude of an effect estimate is not straightforward, since it reflects the effect of changing the PC by one unit on the outcome; however, significance and orientation of effects can be interpreted. When positive loadings are applied to exposures that are positively associated with the outcome, the MR estimate is positive; conversely, if negative loadings are applied, the MR estimate is negative. sPCA methods We next employed multiple sPCA methods (Table 2) that each shrink a proportion of loadings to zero. The way this is achieved differs in each method. Their underlying assumptions and details on differences in optimisation are presented in Table 2 and further described in Materials and methods. Table 2 Overview of sparse principal component analysis (sPCA) methods used. KSS: Karlis-Saporta-Spinaki criterion. Package: R package implementation; Features: short description of the method; Choice: method of selection of the number of informative components in real data; PCs: number of informative PCs. MethodPackageAuthorsFeaturesChoicePCsRSPCApcaPPCroux et al., 2013Robust sPCA (RSPCA), different measure of dispersion (Qn)Permutation KSS6SFPCACode in publication, Supplementary MaterialGuo et al., 2010Fused penalties for block correlationKSS6sPCAelasticnetZou et al., 2006Formulation of sPCA as a regression problemKSS6SCASCAChen and Rohe, 2021Rotation of eigen vectors for approximate sparsityPermutation KSS6 RSPCA (Croux et al., 2013) Optimisation and the KSS criterion pick six PCs to be informative (Karlis et al., 2003). The loadings in Figure 2 show a VLDL-, LDL-dominant PC1, with some small and medium HDL-related traits. LDL.C and ApoB received the 5th and 40th largest positive loadings. PCs 1 and 6 are positively associated with CHD and PCs 3 and 5 negatively so (Appendix 1—table 1). SFPCA (Guo et al., 2010) The KSS criterion retains six PCs. The loadings matrix (Figure 2) shows the 'fused' loadings with the identical colouring. In the two first PCs, all groups are represented. Both ApoB and LDL.C received the seventh and tenth largest loadings, together with other metabolites (Figure 2). PC1 (all groups represented) was positively associated with CHD and PC4 (negative loadings from large HDL traits) negatively so (Appendix 1—table 1). sPCA (Zou et al., 2006) The number of non-zero metabolites per PC was set at 14897∼16 (see Appendix 1—figure 6). Under this level of sparsity, the permutation-based approach suggested that six sPCs should be retained. Seventy exposures received a zero loading across all components. PC1 is constructed predominantly from LDL traits and is positively associated with CHD, but this does not retain statistical significance at the nominal level in MVMR analysis (Figure 3). Only PC4 that is comprised of small and medium HDL traits (Figure 2b) appears to exert a negative causal effect on CHD (OR (95% CI): 0.9975⁢(0.9955,0.9995)). The other PCs were not associated with CHD (all p values > 0.05, Appendix 1—table 1). Figure 3 Download asset Open asset Comparison of univariable Mendelian randomisation (UVMR) and multivariable MR (MVMR) estimates and presentation of the major group represented in each principal component (PC) per method. SCA (Chen and Rohe, 2021) Six components were retained after a permutation test. In the final model, five metabolites were regularised to zero in all PCs (CH2.DB.ratio, CH2.in.FA, FAw6, S.VLDL.C, S.VLDL.FC, Figure 2). Little overlap is noted among the metabolites. PC1 receives loadings from LDL and IDL, and PC2 from VLDL. The contribution of HDL to PCs is split in two, with large and extra-large HDL traits contributing to PC3 and small and medium ones to PC4. PC1 and PC2 were positively associated with CHD (Appendix 1—table 1, Figure 3). PC4 was negatively associated with CHD. Comparison with UVMR In principle, all PC methods derive independent components. This is strictly the case in standard PCA, where subsequent PCs are perfectly orthogonal, but is only approximately true in sparse implementations. We hypothesised that UVMR and MVMR could provide similar causal estimates of the associations of metabolite PCs with CHD. The results are presented in Figure 3 and concordance between UVMR and MVMR is quantified with the R2 from a linear regression. The largest agreement of the causal estimates is observed in PCA. In the sparse methods, SCA (Chen and Rohe, 2021) and sPCA (Zou et al., 2006) provide similarly consistent estimates, whereas some disagreement is observed in the estimate of PC6 for RSPCA (Croux et al., 2013) on CHD. A previous study implicated LDL.c and ApoB as causal for CHD (Zuber et al., 2020b). In Appendix 1—figure 7, we present the loadings for these two exposures across the PCs for the various methods. Ideally, we would like to see metabolites contributing to a small number of components for the sparse methods. Using a visualisation technique proposed by Kim and Kim, 2012, this is indeed observed (see Appendix 1—figure 7). In PCA, LDL.c and ApoB contribute to multiple PCs, whereas the sPCA methods limit this to one PC. Only in RSPCA do these exposures contribute to two PCs. In the second-step IVW meta-analysis, it appears that the PCs comprising of predominantly VLDL/LDL and HDL traits robustly associate with CHD, with differences among methods (Table 3). Table 3 Results for principal component analysis (PCA) approaches. Overlap: Percentage of metabolites receiving non-zero loadings in ≥1 component. Overlap in PC1, PC2: overlap as above but exclusively for the first two components which by definition explain the largest proportion of variance. Very low-density lipoprotein (VLDL), low-density lipoprotein (LDL), and high-density lipoprotein (HDL) significance: results of the IVW regression model with CHD as the outcome for the respective sPCs (the sPCs that mostly received loadings from these groups). The terms VLDL and LDL refer to the respective transformed blocks of correlated exposures; for instance, VLDL refers to the weighted sum of the correlated VLDL-related γ^ associations, such as VLDL phospholipid content and VLDL triglyceride content. †: RSPCA projected VLDL- and LDL-related traits to the same PC (sPC1). ‡: SCA discriminated HDL molecules in two sPCs, one for traits of small- and medium-sized molecules and one for large- and extra-large-sized. PCARSPCASFPCAsPCASCAOverlap10.93810.1870.196Overlap in PC1,PC210.43310.0100Sparse %00.4740.0820.8350.796VLDL significance in MR†YesNoYesNoYesLDL significance in MRNoYesNoNoYesHDL significance in MR‡YesYesYesNoNoSmall, medium HDL significance in MRYesNoYesYesYes Instrument strength Instrument strength for the chosen PCs was assessed via an F-statistic, calculated using a bespoke formula that accounts for the PC process (see Materials and methods and Appendix). The F-statistics for all transformed exposures cross the cutoff of 10. There was a trend for the first components being more strongly instrumented in all methods (see Appendix 1—figure 5), which is to be expected. In the MVMR analyses, the CFS for all exposures was less than three. Thus the move to PC-based analysis significantly improved instrument strength and mitigated against weak instrument bias. Simulation studies We consider the case of a data that reflects in we consider a set of exposures which can be into blocks based on groups of variants contribute exclusively to blocks of exposures, no effect on other This in to correlation among the exposure blocks and a much correlation of between exposure due only to This is visualised in Figure This data to reduce the strength in all exposures. The dataset consists of exposures, p SNPs (with and p of and a outcome, We split the simulation results into one example and one Figure 4 Download asset Open asset Simulation Data for the simulation study, with six exposures and two In the exposures that are correlated due to a genetic component are Simulation results for six exposures and three methods component analysis and Rohe, principal component analysis multivariable Mendelian randomisation The exposures that contribute to are presented in of and that do not in of In the each exposure is a In the first and the PCs that to these exposures are presented as in and are visualised as error proportion of where the is example We data under the presented in Figure with six individual exposures split into two blocks and A outcome is that is only affected by the exposures in block 1 A range of were used in the simulation in order to a range of CFS values from approximately We apply MVMR with the six individual exposures and PCA and The aim of approach is to the of the exposure into two PCs, so that the first PC has high loadings for block 1 and the PC has high loadings for block 2 Although two PCs were chosen by PCA methods using a KSS criterion in a large of to the simulation we a the number of PCs at two across all Our was to the of MVMR PCA than as the two approaches are not in this To do this we each method as a which true positive true negative positive and negative In a is an exposure that is causal in the underlying model and causal estimate is deemed In the PCA and sPCA methods, this is with to which determine each PC and if the causal estimate of this PC is are considered to be major contributors to a PC if only their individual PC loading is than the the causal effect estimate of a PC in the analysis deemed major contributors that are causal and are as and and are error therefore to the and power to the statistical were at the = = PCA, and MVMR error and power are in the three to in Figure results suggest an improved power in true causal associations with PCA and SCA with MVMR when the CFS is at the cost of an error As and CFS MVMR performs the PC of the exposures, PCA to have a error in Figure In this setting, the of PCA therefore appears to be example The aim of the simulation is to estimate the comparative performance of the methods in a that more real data We genetic data and individual level exposure and outcome data for between exposures, in The underlying data and the process of method performance is identical to the but the number of exposures, and the blocks is We results across all by and and then all methods by their under the using the approach of et al., the method performs a meta-analysis of multiple studies that and of a in order to provide a summary A model is and estimates are In our the the results of different simulation settings with of exposures and was also with high values being of Two sPCA methods and Rohe, sPCA et al., the (Figure This is mainly by an in for these methods with A at the individual simulation results the of these two methods, as high (Appendix 1—figure Both standard and Bonferroni-corrected MVMR in terms of and due to PCA with equal and results PCA and RSPCA did not identify negative results and RSPCA and This extreme result can be understood by at the individual simulation results in Appendix 1—figure PCA and RSPCA to the of the a low performance in exposures. the estimates with these methods were very precise across and this in many results and low We note a performance among the methods methods are on the results of SCA are more variable in and (Table 4). The for these methods are also the (Figure the instrument strength in γ^ from to and mean conditional F-statistic (Appendix 1—figure suggests a similar for sparse methods. Figure 5 Download asset Open asset for all methods. sparse component analysis (Chen and Rohe, 2021) sparse PCA (Zou et al., 2006) robust sparse PCA (Croux et al., PCA: principal component analysis; MVMR: multivariable Mendelian MVMR with with large MVMR can not between positive and negative exposures as robustly as the sPCA methods. A major of the of these methods appears to be the number of causal exposures, as in a simulation with only four of the exposures being there was a drop in and across all methods. sPCA methods other methods in this (Appendix 1—table 2). Table 4 and presented as and range across all as and range across all under the PCA In the example of Figure 4 and indeed any other if two PCs are PCA between causal and exposures. The only information used in this of the 2 and 3 in Figure is the association the of to PCs is genetic correlation and correlation due to than these blocks if only a of the exposures it is likely that, PCA identify the block as This the proportion of exposures within blocks of exposures that is a of To we the proportion of exposures by the sparsity of the causal effect

  • Research Article
  • Cite Count Icon 1
  • 10.62411/jais.v9i1.9923
Utilization Of Principal Component Analysis To Improve Emotion Classification Performance In Text Using Artificial Neural Networks
  • Apr 21, 2025
  • Journal of Applied Intelligent System
  • Mahazam Afrad + 2 more

Emotions, being transient and variable, differ across locations, times, and individuals. Automatic emotion identification holds significant importance across various domains, such as education and business. In education, emotional analysis contributes to intelligent electronic learning environments, while in business, it aids in assessing customer satisfaction with products. This study advocates the application of Principal Component Analysis (PCA) to enhance the performance of text emotion classification using the Artificial Neural Network (ANN) method. PCA, a pattern identification method, reduces text dimensions, improving the classification process by determining word similarities. PCA offers the advantage of dimension reduction without compromising information integrity. The classification approach involves two stages: one after PCA dimension reduction and the other without PCA post TF-IDF stage. The study's conclusive findings, incorporating PCA in ANN classification, demonstrated a notable increase in recall for the happy class, reaching 0.92 compared to the pre-PCA score of 0.91. Furthermore, precision in the sadness class improved to 0.90, surpassing the pre-PCA precision of 0.80. This affirms the efficacy of integrating PCA in enhancing the accuracy and performance of emotion classification in text analysis.

Save Icon
Up Arrow
Open/Close