Abstract

One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.

Highlights

  • With the rapid development of biotechnology and sequencing technology, a large number of high-dimensional genetic data have been generated

  • We proposed a two-stage association analysis method by combining variable selection with machine learning, two-stage algorithm based on least angle regression and random forest (TSLRF)

  • Each of the 1,000 replications was analyzed by TSLRF, two-stage stepwise variable selection based on random forest30(TSRF), RF, support vector regression (SVR), ANN and EMMAX, respectively

Read more

Summary

Introduction

With the rapid development of biotechnology and sequencing technology, a large number of high-dimensional genetic data have been generated. How to analysis these kind datasets is a hot topic. Machine learning methods are alternative to classical statistical approaches of mining “big P, small N” genetic datasets by optimizing the classification. In real data analysis at present, the methodologies for genome-wide single-marker scan under polygenic background and population structure controls are widely used to conduct GWAS, such as efficient mixed model association (EMMA)[25] and its improved method EMMA eXpedited (EMMAX)[26], which reduces the computational time for analyzing large GWAS datasets from years to hours. Multi-locus GWAS analysis methods have been proposed, such as fast multi-locus random-SNP-effect EMMA (FASTmrEMMA)[16], which is more powerful in QTN detection and model fit

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.