Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.

Catherine Ching Han Chang,Jiangning Song,Bengti Tey,Geoffrey I Webb,Ramakrishnan Nagasundara Ramanan,Chen Li

doi:10.1038/srep21844

Abstract

Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

Highlights

Of protein folding influences the amount of proteins expressed, and can be estimated from the amino acid sequences[21,22,23]
We designed a predictor with a two-stage architecture that first classifies an input sequence into high, medium or low expression level and subsequently estimates the soluble protein yield in the periplasm of E. coli
Periscope offers an optional output delivery mode where users can retrieve the prediction output in a text file via email. This additional function allows the user to save the prediction output for interpretation or follow-up analysis. The application of both correlation-based feature selection (CFS) and subset size forward selection as the features selection approach resulted in a subset of seven features for the primary classification task (Table 3)

Summary

Introduction

Of protein folding influences the amount of proteins expressed, and can be estimated from the amino acid sequences[21,22,23]. A number of computational algorithms and tools have been developed to predict both protein solubility and protein folding rate[21,23,24,25,26] based on the correlations between amino acid sequence and these two important protein properties. Classifiers are mainly built using SVM, while for real-valued protein folding rate prediction, multiple linear regressions or SVR are employed by most tools. The prediction tools for real-valued protein folding rate achieved correlation coefficients greater than 0.728. Given the amino acid sequence of a signal peptide–target protein combination, Periscope is able to classify the soluble expression of the target protein into one of the three classes (high, medium, or low expression level) and further predict the quantity of soluble protein in the periplasm of E. coli, in the unit of mg/l

Methods

Results

Conclusion