EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks.
Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
- Research Article
- 10.1101/2025.07.21.665218
- Jul 25, 2025
- bioRxiv : the preprint server for biology
Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph .
- Research Article
10
- 10.1371/journal.pone.0242943
- Nov 30, 2020
- PloS one
Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
- Components
3
- 10.1371/journal.pone.0242943.r006
- Nov 30, 2020
Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
- Book Chapter
- 10.1007/978-981-16-6554-7_54
- Nov 12, 2021
Essential genes are indispensable for biological survival. Thus it is of great significance to identify and study essential genes. A machine learning method, K-Nearest Neighbor, is used for development of predicting essential bacterial genes. The homologous features, including sequence homology and functional homology, of the bacterial genomes are extracted for determining essential genes. Based on the features, we use K-Nearest Neighbor algorithm for determining of gene function. And we tune the minimum matching parameter (K) in the essential gene predicted model for building an optimal model of the Escherichia coli specificity model. The corresponding optimal parameter (K) is then extended to other bacterial essential genes predicting models. After cross validation, the highest accuracy is 0.89 while K between 5 and 7. Therefore, the features we extracted can increase the accuracy of the bacterial essential gene prediction. In the premise, we found that the prediction accuracy of the prediction model based on K-Nearest Neighbor was not significantly different in different evolutionary distances between organisms in the database and the investigated species. That means the machine learning model can be extended to more distant species. It wills have a better predictive performance for predicting essential genes of distant species than the usual sequence-based methods.KeywordsEssential genesMachine learningKNN
- Research Article
26
- 10.3390/genes10010031
- Jan 8, 2019
- Genes
Essential genes play an indispensable role in supporting the life of an organism. Identification of essential genes helps us to understand the underlying mechanism of cell life. The essential genes of bacteria are potential drug targets of some diseases genes. Recently, several computational methods have been proposed to detect essential genes based on the static protein–protein interactive (PPI) networks. However, these methods have ignored the fact that essential genes play essential roles under certain conditions. In this work, a novel method was proposed for the identification of essential proteins by fusing the dynamic PPI networks of different time points (called by FDP). Firstly, the active PPI networks of each time point were constructed and then they were fused into a final network according to the networks’ similarities. Finally, a novel centrality method was designed to assign each gene in the final network a ranking score, whilst considering its orthologous property and its global and local topological properties in the network. This model was applied on two different yeast data sets. The results showed that the FDP achieved a better performance in essential gene prediction as compared to other existing methods that are based on the static PPI network or that are based on dynamic networks.
- Book Chapter
- 10.1007/978-3-642-21260-4_9
- Jan 1, 2011
Essential genes are indispensable for an organism’s living. These genes are widely discussed, and many researchers proposed prediction methods that not only find essential genes but also assist pathogens discovery and drug development. However, few studies utilized the relationship between gene functions and essential genes for essential gene prediction. In this paper, we explore the topic of essential gene prediction by adopting the association rule mining technique with Gene Ontology semantic analysis. First, we proposed two features named GOARC (Gene Ontology Association Rule Confidence) and GOCBA (Gene Ontology Classification Based on Association), which are used to enhance the classifier constructed with the features commonly used in previous studies. Secondly, we use an association-based classification algorithm without rule pruning for predicting essential genes. Through experimental evaluations and semantic analysis, our methods can not only enhance the accuracy of essential gene prediction but also facilitate the understanding of the essential genes’ semantics in gene functions.KeywordsData MiningGene OntologyEssential GeneAssociation Rule Mining
- Research Article
48
- 10.1371/journal.pcbi.1008229
- Sep 16, 2020
- PLOS Computational Biology
Accurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance. We propose a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method is utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features are integrated to train a multilayer neural network. A cost-sensitive technique is used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes show that our proposed method, DeepHE, can accurately predict human gene essentiality with an average performance of AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compare DeepHE with several widely used traditional machine learning models (SVM, Naïve Bayes, Random Forest, and Adaboost) using the same features and utilizing the same cost-sensitive technique to against the imbalanced learning issue. The experimental results show that DeepHE significantly outperforms the compared machine learning models. We have demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.
- Research Article
3
- 10.1016/j.mimet.2021.106297
- Jul 31, 2021
- Journal of Microbiological Methods
Predicting essential genes of 37 prokaryotes by combining information-theoretic features
- Research Article
38
- 10.1016/j.csbj.2019.05.008
- Jan 1, 2019
- Computational and Structural Biotechnology Journal
An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
- Supplementary Content
13
- 10.1186/1471-2334-13-227
- May 20, 2013
- BMC Infectious Diseases
BackgroundTuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (M. tuberculosis). The annotation of functional genome and signaling network in M. tuberculosis are still not systematic. Essential gene modules are a collection of functionally related essential genes in the same signaling or metabolic pathway. The determination of essential genes and essential gene modules at genomic level may be important for better understanding of the physiology and pathology of M. tuberculosis, and also helpful for the development of drugs against this pathogen. The establishment of genomic operon database (DOOR) and the annotation of gene pathways have felicitated the genomic analysis of the essential gene modules of M. tuberculosis.MethodBibliometric approach has been used to perform a High-throughput screen for essential genes of M. tuberculosis strain H37Rv. Ant colony algorithm were used to identify the essential genes in other M. tuberculosis reference strains. Essential gene modules were analyzed by operon database DOOR. The pathways of essential genes were assessed by Biocarta, KEGG, NCI-PID, HumanCyc and Reactome. The function prediction of essential genes was analyzed by Pfam.ResultsA total approximately 700 essential genes were identified in M. tuberculosis genome. 40% of operons are consisted of two or more essential genes. The essential genes were distributed in 92 pathways in M. tuberculosis. In function prediction, 61.79% of essential genes were categorized into virulence, intermediary metabolism/respiration,cell wall related and lipid metabolism, which are fundamental functions that exist in most bacteria species.ConclusionWe have identified the essential genes of M. tuberculosis using bibliometric approach at genomic level. The essential gene modules were further identified and analyzed.
- Book Chapter
- 10.1007/978-3-030-26969-2_51
- Jan 1, 2019
Prediction of differential essential genes is an important field to research cell development and differentiation, drug discovery and disease causes. The goal of this work is to extract gene expression and topological changes in biomolecular networks for identifying the essential nodes or modules. Based on the random forests model, this paper proposed an essential node prediction algorithm for biomolecular networks called Differential Network Analysis method based on Random Forests (DNARF). The algorithm had two main points. First, the five-dimension eigenvector construction method was put forward to extract the differential information of nodes in networks. Second, a positive sample expansion method based on the Pearson correlation coefficient was present to solve the problem that positive and negative samples may be unbalanced. In the simulated data experiments, the DNARF algorithm was compared with three other algorithms. The results showed that the DNARF had an excellent performance on the prediction of essential genes. In the real data experiments, four gene regulatory networks were used as datasets. DNARF algorithm predicted five essential genes related to leukemia: HES1, STAT1, TAL1, SPI1 and RFXANK, which had been proved by literatures. Also, DNARF could be applied to other biological networks to identify new essential genes.
- Research Article
23
- 10.1016/j.gene.2014.08.046
- Aug 27, 2014
- Gene
Analysis and identification of essential genes in humans using topological properties and biological information
- Research Article
2
- 10.1088/1755-1315/655/1/012019
- Feb 1, 2021
- IOP Conference Series: Earth and Environmental Science
Essential genes are subset of genes required by an organism for growth and sustenance of life and as well responsible for phenotypic changes when their activities are altered. They have been utilized as drug targets, disease control agent, etc. Essential genes have been widely identified especially in microorganisms, due to the extensive experimental studies on some of them such as Escherichia coli and Saccharomyces cerevisiae. Experimental approach has been a reliable method to identify essential genes. However, it is complex, costly, labour and time intensive. Therefore, computational approach has been developed to complement the experimental approach in order to minimize resources required for essentiality identification experiments. Machine learning approaches have been widely used to predict essential genes in model organisms using different categories of features with varying degrees of accuracy and performance. However, previous studies have not established the most important categories of features that provide the distinguishing power in machine learning essentiality predictions. Therefore, this study evaluates the discriminating strength of major categories of features used in essential gene prediction task as well as the factors responsible for effective computational prediction. Four categories of features were considered and k- fold cross-validation machine learning technique was used to build the classification model. Our results show that ontology features with an AUROC score of 0.936 has the most discriminating power to classify essential and non-essential genes. This studyconcludes that more ontology related features will further improve the performance of machine learning approach and also sensitivity, precision and AUPRC are realistic measures of performance in essentiality prediction.
- Research Article
41
- 10.1186/s12918-014-0117-z
- Oct 15, 2014
- BMC Systems Biology
BackgroundClostridium difficile is the leading cause of hospital-borne infections occurring when the natural intestinal flora is depleted following antibiotic treatment. Current treatments for Clostridium difficile infections present high relapse rates and new hyper-virulent and multi-resistant strains are emerging, making the study of this nosocomial pathogen necessary to find novel therapeutic targets.ResultsWe present iMLTC806cdf, an extensively curated reconstructed metabolic network for the C. difficile pathogenic strain 630. iMLTC806cdf contains 806 genes, 703 metabolites and 769 metabolic, 117 exchange and 145 transport reactions. iMLTC806cdf is the most complete and accurate metabolic reconstruction of a gram-positive anaerobic bacteria to date. We validate the model with simulated growth assays in different media and carbon sources and use it to predict essential genes. We obtain 89.2% accuracy in the prediction of gene essentiality when compared to experimental data for B. subtilis homologs (the closest organism for which such data exists). We predict the existence of 76 essential genes and 39 essential gene pairs, a number of which are unique to C. difficile and have non-existing or predicted non-essential human homologs. For 29 of these potential therapeutic targets, we find 125 inhibitors of homologous proteins including approved drugs with the potential for drug repositioning, that when validated experimentally could serve as starting points in the development of new antibiotics.ConclusionsWe created a highly curated metabolic network model of C. difficile strain 630 and used it to predict essential genes as potential new therapeutic targets in the fight against Clostridium difficile infections.Electronic supplementary materialThe online version of this article (doi:10.1186/s12918-014-0117-z) contains supplementary material, which is available to authorized users.
- Research Article
48
- 10.3389/fmicb.2017.02331
- Nov 27, 2017
- Frontiers in Microbiology
Genes critical for the survival or reproduction of an organism in certain circumstances are classified as essential genes. Essential genes play a significant role in deciphering the survival mechanism of life. They may be greatly applied to pharmaceutics and synthetic biology. The continuous progress of experimental method for essential gene identification has accelerated the accumulation of gene essentiality data which facilitates the study of essential genes in silico. In this article, we present some available online resources related to gene essentiality, including bioinformatic software tools for transposon sequencing (Tn-seq) analysis, essential gene databases and online services to predict bacterial essential genes. We review several computational approaches that have been used to predict essential genes, and summarize the features used for gene essentiality prediction. In addition, we evaluate the available online bacterial essential gene prediction servers based on the experimentally validated essential gene sets of 30 bacteria from DEG. This article is intended to be a quick reference guide for the microbiologists interested in the essential genes.
- New
- Research Article
- 10.1093/gigascience/giaf145
- Nov 29, 2025
- GigaScience
- New
- Research Article
- 10.1093/gigascience/giaf146
- Nov 29, 2025
- GigaScience
- New
- Research Article
- 10.1093/gigascience/giaf144
- Nov 29, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf143
- Nov 20, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf142
- Nov 17, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf141
- Nov 13, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf133
- Nov 12, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf140
- Nov 6, 2025
- GigaScience
- Research Article
- 10.1093/gigascience/giaf122
- Nov 6, 2025
- GigaScience
- Supplementary Content
- 10.1093/gigascience/giaf113
- Oct 31, 2025
- GigaScience
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.