Abstract
Genomes encode elaborate networks of genes whose products must seamlessly interact to support living organisms. Humans' capacity to understand these biological systems is limited by their sheer size and complexity. In this article, we develop a proof of concept framework for training a machine learning (ML) algorithm to model bacterial genome composition. To achieve this, we create simplified representations of genomes in the form of binary vectors that indicate the encoded genes, henceforth referred to as genome vectors. A denoising variational autoencoder was trained to accept corrupted genome vectors, in which most genes had been masked, and reconstruct the original. The resulting model, DeepGenomeVector, effectively captures complex dependencies in genomic networks, as evaluated by both qualitative and quantitative metrics. An in-depth functional analysis of a generated genome vector shows that its encoded pathways are interconnected, near complete, and ecologically cohesive. On the test set, where the model's ability to reconstruct uncorrupted genome vectors was evaluated, Area Under the Receiver Operating Curve (AUROC) and F1 scores of 0.98 and 0.83, respectively, support the model's strong performance. This article showcases the power of ML approaches for synthetic biology and highlights the possibility that artifical intelligence agents may one day be able to design genomes that animate carbon-based cells.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have