Abstract
The least absolute shrinkage and selection operator (lasso) and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes. The prediction accuracy of these estimation methods is highly dependent on the covariance structure, which is characterized by gene regulation networks. However, the manner in which the structure of a gene regulation network together with the sample size affects prediction accuracy has not yet been sufficiently investigated. In this study, Monte Carlo simulations are conducted to investigate the prediction accuracy for several network structures under various sample sizes. When the gene regulation network is a random graph, a sufficiently large number of observations are required to ensure good prediction accuracy with the lasso. The PCR provided poor prediction accuracy regardless of the sample size. However, a real gene regulation network is likely to exhibit a scale-free structure. In such cases, the simulation indicates that a relatively small number of observations, such as N=300, is sufficient to allow the accurate prediction of traits from a transcriptome with the lasso.
Highlights
The least absolute shrinkage and selection operator and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes
We compared the performance of the lasso with that of the PCR
In a gene regulation network, a gene regulates a small portion of a genome, not all the genes in a genome
Summary
The least absolute shrinkage and selection operator (lasso) and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes. If one is to apply least-squares estimation in multiple regression (e.g. trait ≈ β0 + β1 gene1 + β2 gene2 + · · · ) to predict a trait value from a transcriptome, the sample size needs to be (at least) larger than the number of model parameters. When a random vector of exploratory variables follows a multivariate normal distribution, two variables are conditionally independent if and only if the corresponding element of the inverse covariance matrix is zero.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.