Abstract

With the advancement of technology, analysis of large-scale data of gene expression is feasible and has become very popular in the era of machine learning. This paper develops an improved ridge approach for the genome regression modeling. When multicollinearity exists in the data set with outliers, we consider a robust ridge estimator, namely the rank ridge regression estimator, for parameter estimation and prediction. On the other hand, the efficiency of the rank ridge regression estimator is highly dependent on the ridge parameter. In general, it is difficult to provide a satisfactory answer about the selection for the ridge parameter. Because of the good properties of generalized cross validation (GCV) and its simplicity, we use it to choose the optimum value of the ridge parameter. The GCV function creates a balance between the precision of the estimators and the bias caused by the ridge estimation. It behaves like an improved estimator of risk and can be used when the number of explanatory variables is larger than the sample size in high-dimensional problems. Finally, some numerical illustrations are given to support our findings.

Highlights

  • High-dimensional statistical inference is essential whenever the number of unknown parameters is larger than sample size

  • The riboflavin production data set with Bacillus subtilis (Lee et al [1] and Zamboni et al [2]) includes the logarithm of the riboflavin production rate as the response variable along with 4088 covariates which are the logarithm of the expression levels of 4088 genes, which are normalized using the Affymetrix oligonucleotide arrays normalizing methods

  • A relevant family of methods for prediction of the response based on the high dimensional gene expression data are sparse linear regression models

Read more

Summary

Introduction

High-dimensional statistical inference is essential whenever the number of unknown parameters is larger than sample size. High-throughput technology provides large-scale data of gene expressions in transcriptomics. The riboflavin production data set with Bacillus subtilis (Lee et al [1] and Zamboni et al [2]) includes the logarithm of the riboflavin production rate as the response variable along with 4088 covariates which are the logarithm of the expression levels of 4088 genes, which are normalized using the Affymetrix oligonucleotide arrays normalizing methods. One rather homogeneous data set exists from 71 samples that were hybridized repeatedly during a fed-batch fermentation process in which different engineered strains and strains grown under different fermentation conditions were analyzed. A relevant family of methods for prediction of the response based on the high dimensional gene expression data are sparse linear regression models.

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call