Abstract

Several computational methods were proposed for the identification of essential genes (EGs). The machine learning based methods use features derived from the genetic sequences, gene-expression data, network topology, homology, and domain information. Except for the sequence-based features, the others require additional experimental data which is unavailable for under-studied and newly sequenced organisms. Hence, here, we propose a sequence-based identification of EGs. We performed gene essentiality predictions considering 15 bacteria, 1 archeaon, and 4 eukaryotes. Information-theoretic quantities, such as mutual information, conditional mutual information, entropy, Kullback-Leibler divergence, and Markov models, were used as features. In addition, with the hope of improving the prediction performance, other easily accessible sequence-based features related to stop codon usage, length, and GC content were included. For classification, the Random Forest algorithm was used. The performance of the proposed method is extensively evaluated by employing both intra- and cross-organism predictions. The obtained results were better than most of the previously published EG predictors which rely only on sequence information and comparable to those using additional features derived from network topology, homology, and gene-expression data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call