How complex phenotypes emerge from intricate gene expression patterns is a fundamental question in biology. Quantitative characterization of this relationship, however, is challenging due to the vast combinatorial possibilities and dynamic interplay between genotype and phenotype landscapes. Integrating high-content genotyping approaches such as single-cell RNA sequencing and advanced learning methods such as language models offers an opportunity for dissecting this complex relationship. Here, we present a computational integrated genetics framework designed to analyze and interpret the high-dimensional landscape of genotypes and their associated phenotypes simultaneously. We applied this approach to develop a multimodal foundation model to explore the genotype-phenotype relationship manifold for human transcriptomics at the cellular level. Analyzing this joint manifold showed a refined resolution of cellular heterogeneity, enhanced precision in phenotype annotating, and uncovered potential cross-tissue biomarkers that are undetectable through conventional gene expression analysis alone. Moreover, our results revealed that the gene networks are characterized by scale-free patterns and show context-dependent gene-gene interactions, both of which result in significant variations in the topology of the gene network, particularly evident during aging. Finally, utilizing contextualized embeddings, we investigated gene polyfunctionality which illustrates the multifaceted roles that genes play in different biological processes, and demonstrated that for VWF gene in endothelial cells. Overall, this study advances our understanding of the dynamic interplay between gene expression and phenotypic manifestation and demonstrates the potential of integrated genetics in uncovering new dimensions of cellular function and complexity.
Read full abstract