Genotype sampling for deep-learning assisted experimental mapping of a combinatorially complete fitness landscape.

Andreas Wagner

doi:10.1093/bioinformatics/btae317

Abstract

Experimental characterization of fitness landscapes, which map genotypes onto fitness, is important for both evolutionary biology and protein engineering. It faces a fundamental obstacle in the astronomical number of genotypes whose fitness needs to be measured for any one protein. Deep learning may help to predict the fitness of many genotypes from a smaller neural network training sample of genotypes with experimentally measured fitness. Here I use a recently published experimentally mapped fitness landscape of more than 260,000 protein genotypes to ask how such sampling is best performed. I show that multilayer perceptrons, recurrent neural networks (RNNs), convolutional networks, and transformers, can explain more than 90 percent of fitness variance in the data. In addition, 90 percent of this performance is reached with a training sample comprising merely ≈103 sequences. Generalization to unseen test data is best when training data is sampled randomly and uniformly, or sampled to minimize the number of synonymous sequences. In contrast, sampling to maximize sequence diversity or codon usage bias reduces performance substantially. These observations hold for more than one network architecture. Simple sampling strategies may perform best when training deep learning neural networks to map fitness landscapes from experimental data. The fitness landscape data analyzed here is publicly available as described previously (Papkou, et al., 2023). All code used to analyze this landscape is publicly available at https://github.com/andreas-wagner-uzh/fitness_landscape_sampling. Supplementary data are available at Bioinformatics online.

Full Text