Abstract

The local environment of the geographical origin of plants shaped their genetic variations through environmental adaptation. While the characteristics of the local environment correlate with the genotypes and other genomic features of the plants, they can also be indicative of genotype-phenotype associations providing additional information relevant to environmental dependence. In this study, we investigate how the geoclimatic features from the geographical origin of the Arabidopsis thaliana accessions can be integrated with genomic features for phenotype prediction and association analysis using advanced canonical correlation analysis (CCA). In particular, we propose a novel method called hierarchical canonical correlation analysis (HCCA) to combine mutations, gene expressions, and DNA methylations with geoclimatic features for informative coprojections of the features. HCCA uses a condition number of the cross-covariance between pairs of datasets to infer a hierarchical structure for applying CCA to combine the data. In the experiments on Arabidopsis thaliana data from 1001 Genomes and 1001 Epigenomes projects and climatic, atmospheric, and soil environmental variables combined by CLIMtools, HCCA provided a joint representation of the genomic data and geoclimate data for better prediction of the special flowering time at 10°C (FT10) of Arabidopsis thaliana. We also extended HCCA with information from a protein-protein interaction (PPI) network to guide the feature learning by imposing network modules onto the genomic features, which are shown to be useful for identifying genes with more coherent functions correlated with the geoclimatic features. The findings in this study suggest that environmental data comprise an important component in plant phenotype analysis. HCCA is a useful data integration technique for phenotype prediction, and a better understanding of the interactions between gene functions and environment as more useful functional information is introduced by coprojections of multiple genomic datasets.

Highlights

  • With the new high-throughput genomic array and sequencing technologies, large-scale genomic datasets in Arabidopsis thaliana have been greatly augmented or first become available in the last few years

  • To integrate the multiple genomic datasets and geoclimate dataset, we propose a novel method, namely, hierarchical canonical correlation analysis (HCCA), which integrates pairs of datasets hierarchically using the concept of condition number of the cross-covariance between a pair of datasets to determine the hierarchy

  • We compare HCCA with several other baseline methods to predict the flowering time at 10°C (FT10) phenotype collected from AraPheno

Read more

Summary

Introduction

With the new high-throughput genomic array and sequencing technologies, large-scale genomic datasets in Arabidopsis thaliana have been greatly augmented or first become available in the last few years. The early large-scale genomic study in [1] used a 250 k SNP chip with multiple markers in each haplotype block to genotype a regional map (RegMap) of 1,307 accessions and conducted a global comparison of the traits in genetically distinct groups. The 1001 Epigenomes project [5] presented a larger comprehensive resource with 1,107 single-base resolution methylomes of 1,028 accessions and 1,203 transcriptomes of 998 accessions. These resources have enabled more studies leading to better understandings of how the variations contribute to the molecular and nonmolecular phenotypes of Arabidopsis thaliana

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call