Accuracy and data efficiency in deep learning models of protein expression

Evangelos-Marios Nikolados,Diego A Oyarzún,Arin Wongprommoon,Guillaume Cambray,Oisin Mac Aodha

doi:10.1038/s41467-022-34902-5

Abstract

Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature Communications	Publication Date: Dec 15, 2022
Citations: 31	License type: open-access

R Discovery Prime

R Discovery Prime

Accuracy and data efficiency in deep learning models of protein expression

Abstract

Talk to us

Similar Papers

More From: Nature Communications

Lead the way for us

Similar Papers

Clinically Relevant Vulnerabilities of Deep Machine Learning Systems for Skin Cancer Diagnosis
Xinyi Du-Harpur ... Magnus D Lynch
Journal of Investigative Dermatology | VOL. 141
Xinyi Du-Harpur, et. al.Xinyi Du-Harpur ... Magnus D Lynch
12 Sep 2020
Journal of Investigative Dermatology | VOL. 141

Validating a deep learning framework by metamorphic testing
...
-
, et. al. ...
20 May 2017
20 May 2017

A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences.
Johannes Linder ... Georg Seelig
Cell Systems | VOL. 11
Johannes Linder, et. al.Johannes Linder ... Georg Seelig
25 Jun 2020
Cell Systems | VOL. 11

Learning from small data: Classifying sex from retinal images via deep learning.
Aaron Berk ... Parsa Delavari
PloS one | VOL. 18
Aaron Berk, et. al.Aaron Berk ... Parsa Delavari
03 Aug 2023
PloS one | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accuracy and data efficiency in deep learning models of protein expression

Abstract

Talk to us

Similar Papers

More From: Nature Communications