Training Set Selection for the Prediction of Essential Genes

Jian Cheng,Shiheng Tao,Xiangchen Li,Zhao Xu,Yanlin Liu,Li Zhao,Wenwu Wu,Lars Kaderali

doi:10.1371/journal.pone.0086805

Abstract

Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.

Highlights

As a minimal gene subset in organisms, essential genes are required for survival, development and fertility [1,2]
Based on the AUC matrix, the variation of AUC scores from different training sets was displayed as boxplots (Figure 2A), and these variations were applied to determine the influence of different training sets on predictive accuracy
We found that the interquartile ranges (IQRs) of many species within testing sets were .0.03 (Table S2)

Summary

Introduction

As a minimal gene subset in organisms, essential genes are required for survival, development and fertility [1,2] Identifying such genes can aid in understanding the primary structures of complex gene regulatory networks in a cell [3,4,5], elucidating the relationship between genotype and phenotype [6,7] and discovering potential drug targets in novel pathogens [8,9,10]. They can be useful in re-engineering microorganisms [11,12], for investigating the causes of human diseases [13,14]. Recent development in bioinformatics has significantly advanced the computational tools and resources available to investigate essential genes

Methods

Results

Conclusion