Abstract
Despite the growing volume of experimentally validated knowledge about the subcellular localization of plant proteins, a well performing in silico prediction tool is still a necessity. Existing tools, which employ information derived from protein sequence alone, offer limited accuracy and/or rely on full sequence availability. We explored whether gene expression profiling data can be harnessed to enhance prediction performance. To achieve this, we trained several support vector machines to predict the subcellular localization of Arabidopsis thaliana proteins using sequence derived information, expression behavior, or a combination of these data and compared their predictive performance through a cross-validation test. We show that gene expression carries information about the subcellular localization not available in sequence information, yielding dramatic benefits for plastid localization prediction, and some notable improvements for other compartments such as the mitochondrion, the Golgi, and the plasma membrane. Based on these results, we constructed a novel subcellular localization prediction engine, SLocX, combining gene expression profiling data with protein sequence-based information. We then validated the results of this engine using an independent test set of annotated proteins and a transient expression of GFP fusion proteins. Here, we present the prediction framework and a website of predicted localizations for Arabidopsis. The relatively good accuracy of our prediction engine, even in cases where only partial protein sequence is available (e.g., in sequences lacking the N-terminal region), offers a promising opportunity for similar application to non-sequenced or poorly annotated plant species. Although the prediction scope of our method is currently limited by the availability of expression information on the ATH1 array, we believe that the advances in measuring gene expression technology will make our method applicable for all Arabidopsis proteins.
Highlights
In eukaryotic cells, the targeting of proteins to subcellular compartments is universally recognized to be important for proper protein function (Eisenhaber and Bork, 1998)
We developed a novel tool to predict the subcellular localization of Arabidopsis proteins integrating protein amino acid composition with expression profiling data
The final predictor, which was compared with the state of the art predictors, was built using top 1,000 features selected from a mixture of amino acid composition information and expression data. We found this number of features to be sufficient for GENERATION OF CUSTOM VECTOR AND PROTEIN–GFP FUSION CONSTRUCTS Two candidate genes, At1g16000.1 and At5g19540.1, whose subcellular localization was hitherto not experimentally determined were randomly selected
Summary
The targeting of proteins to subcellular compartments is universally recognized to be important for proper protein function (Eisenhaber and Bork, 1998). A wide variety of such N-terminal prediction systems has been developed throughout the years, some methods are limited in accuracy and/or in the breadth of coverage of subcellular compartments These methods fail to make a valid prediction when a protein is targeted to its final compartment through non-classical mechanisms of protein sorting (Herman and Schmidt, 2004; Nickel and Seedorf, 2008; Wienkoop et al, 2010) or contains a non-conventional targeting sequence (Brix et al, 1999; Diekert et al, 1999). We developed a novel tool to predict the subcellular localization of Arabidopsis proteins integrating protein amino acid composition with expression profiling data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.