Abstract
BackgroundAlthough the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BE-IDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches.ResultsVerified through six simulations, the adaptive threshold estimated by the BE-IDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BE-IDC to an Arabidopsis thaliana genome-wide data. Out of 216,130 SNPs, BE-IDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods.ConclusionsBE-IDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection.
Highlights
The dimension of the entire genome can be extremely large, only a parsimonious set of influential single nucleotide polymorphisms (SNPs) are correlated with a particular complex trait and are important to the prediction of the trait
We demonstrate that the backward elimination iterative distance correlation (BE-IDC) approach selects a very small set of SNPs for Arabidopsis thaliana data
Unless a very small number of SNPs is preferred for reason of saving experimental cost in breeding or disease diagnosis applications, we suggest taking the threshold to be that for which the mean square prediction error (MSPE) is minimized
Summary
The dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Genomic selection is improved by identifying a small subset of influential single nucleotide polymorphisms (SNPs) from high-dimensional genetic information to efficiently predict individual’s phenotype [1,2,3,4,5]. Li et al developed a distance correlation based sure independence feature screening (DC-SIS) strategy that defines an association strength measure for each feature based on its distance correlation with the phenotype [16]. The idea of DC-SIS is to theoretically satisfies the sure screening property, ranks the features from the most important to the least important by decreasing distance correlation values, and filters the majority of noise with low values of the defined association strength measure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.