Over just the last 2 years, mRNA therapeutics and vaccines have undergone a rapid transition from an intriguing concept to real-world impact. However, whereas some aspects of mRNA therapeutics, such as the use of chemical modifications to increase stability and reduce immunogenicity, have been extensively optimized for over two decades, other aspects, particularly the selection and design of the noncoding leader and trailer sequences which control translation efficiency and stability, have received comparably less attention. In practice, such 5' and 3' untranslated regions (UTRs) are often borrowed from highly expressed human genes with few or no modifications, as in the case for the Pfizer/BioNTech Covid vaccine. Focusing on the 5'UTR, we here argue that model-driven design is a promising alternative that provides unprecedented control over 5'UTR function. We review recent work that combines synthetic biology with machine learning to build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5'UTR sequence. We first introduce an experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of thousands of 5'UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random sequence inserted into the 5'UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with degenerate 5'UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library of over 30 000 human 5'UTR fragments and using translation reporter data collected independently by other groups. Both the experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1-methyl-pseudouridine (m1Ψ). We find that, in general, 5'UTRs have very similar impacts when combined with different protein-coding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be generalized to other gene regions and applications.
Read full abstract