Fast activation maximization for molecular sequence design

Johannes Linder,Georg Seelig

doi:10.1186/s12859-021-04437-5

Johannes Linder, Georg Seelig

Open Access

https://doi.org/10.1186/s12859-021-04437-5

Copy DOI

Journal: BMC bioinformatics	Publication Date: Oct 20, 2021
Citations: 16	License type: open-access

Affiliation: University of Washington, Seattle University

Abstract

BackgroundOptimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.ResultsHere, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp’s capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor.ConclusionsFast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.

Highlights

Optimization of deoxyribonucleic acid (DNA) and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design
We examined whether methods based on direct optimization in general reach higher fitness scores than conditioning of generative models when there is a low degree of epistemic uncertainty
Inspired by instance normalization in image generative adversarial network (GAN) [51], we hypothesized that the main bottleneck in earlier design methods—both in terms of optimization speed and minima found—stem from overly large and disproportionally scaled nucleotide logits

Summary

Introduction

Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Rational design of DNA, RNA and protein sequences has enabled the rapid development of a wide range of biomolecules, including functional or stably folded proteins [1,2,3], optimized promoter sequences [4], active enzymes [5] and de novo antibody components [6, 7]. These design principles are starting to be applied to specific therapeutic domains, for example AAV gene therapy [8], antimicrobial peptides [9] and vaccines [10, 11]. These methods first require selecting an appropriate generative network and tuning several hyper-parameters

Methods

Results

Discussion

Conclusion