EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks.

  • Abstract
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.

Similar Papers
  • Research Article
  • 10.1101/2025.07.21.665218
EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks.
  • Jul 25, 2025
  • bioRxiv : the preprint server for biology
  • Haimei Wen + 5 more

Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph .

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1371/journal.pone.0242943
Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy.
  • Nov 30, 2020
  • PloS one
  • Sutanu Nandi + 2 more

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.

  • Components
  • Cite Count Icon 3
  • 10.1371/journal.pone.0242943.r006
Essential gene prediction using limited gene essentiality information–An integrative semi-supervised machine learning strategy
  • Nov 30, 2020
  • Seyedali Mirjalili + 3 more

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.

  • Book Chapter
  • 10.1007/978-981-16-6554-7_54
The Algorithms of Predicting Bacterial Essential Genes and NcRNAs by Machine Learning
  • Nov 12, 2021
  • Yuannong Ye + 2 more

Essential genes are indispensable for biological survival. Thus it is of great significance to identify and study essential genes. A machine learning method, K-Nearest Neighbor, is used for development of predicting essential bacterial genes. The homologous features, including sequence homology and functional homology, of the bacterial genomes are extracted for determining essential genes. Based on the features, we use K-Nearest Neighbor algorithm for determining of gene function. And we tune the minimum matching parameter (K) in the essential gene predicted model for building an optimal model of the Escherichia coli specificity model. The corresponding optimal parameter (K) is then extended to other bacterial essential genes predicting models. After cross validation, the highest accuracy is 0.89 while K between 5 and 7. Therefore, the features we extracted can increase the accuracy of the bacterial essential gene prediction. In the premise, we found that the prediction accuracy of the prediction model based on K-Nearest Neighbor was not significantly different in different evolutionary distances between organisms in the database and the investigated species. That means the machine learning model can be extended to more distant species. It wills have a better predictive performance for predicting essential genes of distant species than the usual sequence-based methods.KeywordsEssential genesMachine learningKNN

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 26
  • 10.3390/genes10010031
A Novel Method for Identifying Essential Genes by Fusing Dynamic Protein⁻Protein Interactive Networks.
  • Jan 8, 2019
  • Genes
  • Fengyu Zhang + 4 more

Essential genes play an indispensable role in supporting the life of an organism. Identification of essential genes helps us to understand the underlying mechanism of cell life. The essential genes of bacteria are potential drug targets of some diseases genes. Recently, several computational methods have been proposed to detect essential genes based on the static protein–protein interactive (PPI) networks. However, these methods have ignored the fact that essential genes play essential roles under certain conditions. In this work, a novel method was proposed for the identification of essential proteins by fusing the dynamic PPI networks of different time points (called by FDP). Firstly, the active PPI networks of each time point were constructed and then they were fused into a final network according to the networks’ similarities. Finally, a novel centrality method was designed to assign each gene in the final network a ranking score, whilst considering its orthologous property and its global and local topological properties in the network. This model was applied on two different yeast data sets. The results showed that the FDP achieved a better performance in essential gene prediction as compared to other existing methods that are based on the static PPI network or that are based on dynamic networks.

  • Book Chapter
  • 10.1007/978-3-642-21260-4_9
Prediction of Essential Genes by Mining Gene Ontology Semantics
  • Jan 1, 2011
  • Yu-Cheng Liu + 3 more

Essential genes are indispensable for an organism’s living. These genes are widely discussed, and many researchers proposed prediction methods that not only find essential genes but also assist pathogens discovery and drug development. However, few studies utilized the relationship between gene functions and essential genes for essential gene prediction. In this paper, we explore the topic of essential gene prediction by adopting the association rule mining technique with Gene Ontology semantic analysis. First, we proposed two features named GOARC (Gene Ontology Association Rule Confidence) and GOCBA (Gene Ontology Classification Based on Association), which are used to enhance the classifier constructed with the features commonly used in previous studies. Secondly, we use an association-based classification algorithm without rule pruning for predicting essential genes. Through experimental evaluations and semantic analysis, our methods can not only enhance the accuracy of essential gene prediction but also facilitate the understanding of the essential genes’ semantics in gene functions.KeywordsData MiningGene OntologyEssential GeneAssociation Rule Mining

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 48
  • 10.1371/journal.pcbi.1008229
DeepHE: Accurately predicting human essential genes based on deep learning.
  • Sep 16, 2020
  • PLOS Computational Biology
  • Xue Zhang + 2 more

Accurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance. We propose a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method is utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features are integrated to train a multilayer neural network. A cost-sensitive technique is used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes show that our proposed method, DeepHE, can accurately predict human gene essentiality with an average performance of AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compare DeepHE with several widely used traditional machine learning models (SVM, Naïve Bayes, Random Forest, and Adaboost) using the same features and utilizing the same cost-sensitive technique to against the imbalanced learning issue. The experimental results show that DeepHE significantly outperforms the compared machine learning models. We have demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.mimet.2021.106297
Predicting essential genes of 37 prokaryotes by combining information-theoretic features
  • Jul 31, 2021
  • Journal of Microbiological Methods
  • Xiao Liu + 4 more

Predicting essential genes of 37 prokaryotes by combining information-theoretic features

  • Research Article
  • Cite Count Icon 38
  • 10.1016/j.csbj.2019.05.008
An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
  • Jan 1, 2019
  • Computational and Structural Biotechnology Journal
  • Tulio L Campos + 3 more

An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features

  • Supplementary Content
  • Cite Count Icon 13
  • 10.1186/1471-2334-13-227
High-throughput screen of essential gene modules in Mycobacterium tuberculosis: a bibliometric approach
  • May 20, 2013
  • BMC Infectious Diseases
  • Guangyu Xu + 7 more

BackgroundTuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (M. tuberculosis). The annotation of functional genome and signaling network in M. tuberculosis are still not systematic. Essential gene modules are a collection of functionally related essential genes in the same signaling or metabolic pathway. The determination of essential genes and essential gene modules at genomic level may be important for better understanding of the physiology and pathology of M. tuberculosis, and also helpful for the development of drugs against this pathogen. The establishment of genomic operon database (DOOR) and the annotation of gene pathways have felicitated the genomic analysis of the essential gene modules of M. tuberculosis.MethodBibliometric approach has been used to perform a High-throughput screen for essential genes of M. tuberculosis strain H37Rv. Ant colony algorithm were used to identify the essential genes in other M. tuberculosis reference strains. Essential gene modules were analyzed by operon database DOOR. The pathways of essential genes were assessed by Biocarta, KEGG, NCI-PID, HumanCyc and Reactome. The function prediction of essential genes was analyzed by Pfam.ResultsA total approximately 700 essential genes were identified in M. tuberculosis genome. 40% of operons are consisted of two or more essential genes. The essential genes were distributed in 92 pathways in M. tuberculosis. In function prediction, 61.79% of essential genes were categorized into virulence, intermediary metabolism/respiration,cell wall related and lipid metabolism, which are fundamental functions that exist in most bacteria species.ConclusionWe have identified the essential genes of M. tuberculosis using bibliometric approach at genomic level. The essential gene modules were further identified and analyzed.

  • Book Chapter
  • 10.1007/978-3-030-26969-2_51
A Novel Differential Essential Genes Prediction Method Based on Random Forests Model
  • Jan 1, 2019
  • Jiang Xie + 5 more

Prediction of differential essential genes is an important field to research cell development and differentiation, drug discovery and disease causes. The goal of this work is to extract gene expression and topological changes in biomolecular networks for identifying the essential nodes or modules. Based on the random forests model, this paper proposed an essential node prediction algorithm for biomolecular networks called Differential Network Analysis method based on Random Forests (DNARF). The algorithm had two main points. First, the five-dimension eigenvector construction method was put forward to extract the differential information of nodes in networks. Second, a positive sample expansion method based on the Pearson correlation coefficient was present to solve the problem that positive and negative samples may be unbalanced. In the simulated data experiments, the DNARF algorithm was compared with three other algorithms. The results showed that the DNARF had an excellent performance on the prediction of essential genes. In the real data experiments, four gene regulatory networks were used as datasets. DNARF algorithm predicted five essential genes related to leukemia: HES1, STAT1, TAL1, SPI1 and RFXANK, which had been proved by literatures. Also, DNARF could be applied to other biological networks to identify new essential genes.

  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.gene.2014.08.046
Analysis and identification of essential genes in humans using topological properties and biological information
  • Aug 27, 2014
  • Gene
  • Lei Yang + 6 more

Analysis and identification of essential genes in humans using topological properties and biological information

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1088/1755-1315/655/1/012019
Performance evaluation of features for gene essentiality prediction
  • Feb 1, 2021
  • IOP Conference Series: Earth and Environmental Science
  • Olufemi Aromolaran + 2 more

Essential genes are subset of genes required by an organism for growth and sustenance of life and as well responsible for phenotypic changes when their activities are altered. They have been utilized as drug targets, disease control agent, etc. Essential genes have been widely identified especially in microorganisms, due to the extensive experimental studies on some of them such as Escherichia coli and Saccharomyces cerevisiae. Experimental approach has been a reliable method to identify essential genes. However, it is complex, costly, labour and time intensive. Therefore, computational approach has been developed to complement the experimental approach in order to minimize resources required for essentiality identification experiments. Machine learning approaches have been widely used to predict essential genes in model organisms using different categories of features with varying degrees of accuracy and performance. However, previous studies have not established the most important categories of features that provide the distinguishing power in machine learning essentiality predictions. Therefore, this study evaluates the discriminating strength of major categories of features used in essential gene prediction task as well as the factors responsible for effective computational prediction. Four categories of features were considered and k- fold cross-validation machine learning technique was used to build the classification model. Our results show that ontology features with an AUROC score of 0.936 has the most discriminating power to classify essential and non-essential genes. This studyconcludes that more ontology related features will further improve the performance of machine learning approach and also sensitivity, precision and AUPRC are realistic measures of performance in essentiality prediction.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 41
  • 10.1186/s12918-014-0117-z
A curated C. difficile strain 630 metabolic network: prediction of essential targets and inhibitors.
  • Oct 15, 2014
  • BMC Systems Biology
  • Mathieu Larocque + 2 more

BackgroundClostridium difficile is the leading cause of hospital-borne infections occurring when the natural intestinal flora is depleted following antibiotic treatment. Current treatments for Clostridium difficile infections present high relapse rates and new hyper-virulent and multi-resistant strains are emerging, making the study of this nosocomial pathogen necessary to find novel therapeutic targets.ResultsWe present iMLTC806cdf, an extensively curated reconstructed metabolic network for the C. difficile pathogenic strain 630. iMLTC806cdf contains 806 genes, 703 metabolites and 769 metabolic, 117 exchange and 145 transport reactions. iMLTC806cdf is the most complete and accurate metabolic reconstruction of a gram-positive anaerobic bacteria to date. We validate the model with simulated growth assays in different media and carbon sources and use it to predict essential genes. We obtain 89.2% accuracy in the prediction of gene essentiality when compared to experimental data for B. subtilis homologs (the closest organism for which such data exists). We predict the existence of 76 essential genes and 39 essential gene pairs, a number of which are unique to C. difficile and have non-existing or predicted non-essential human homologs. For 29 of these potential therapeutic targets, we find 125 inhibitors of homologous proteins including approved drugs with the potential for drug repositioning, that when validated experimentally could serve as starting points in the development of new antibiotics.ConclusionsWe created a highly curated metabolic network model of C. difficile strain 630 and used it to predict essential genes as potential new therapeutic targets in the fight against Clostridium difficile infections.Electronic supplementary materialThe online version of this article (doi:10.1186/s12918-014-0117-z) contains supplementary material, which is available to authorized users.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 48
  • 10.3389/fmicb.2017.02331
A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes.
  • Nov 27, 2017
  • Frontiers in Microbiology
  • Chong Peng + 3 more

Genes critical for the survival or reproduction of an organism in certain circumstances are classified as essential genes. Essential genes play a significant role in deciphering the survival mechanism of life. They may be greatly applied to pharmaceutics and synthetic biology. The continuous progress of experimental method for essential gene identification has accelerated the accumulation of gene essentiality data which facilitates the study of essential genes in silico. In this article, we present some available online resources related to gene essentiality, including bioinformatic software tools for transposon sequencing (Tn-seq) analysis, essential gene databases and online services to predict bacterial essential genes. We review several computational approaches that have been used to predict essential genes, and summarize the features used for gene essentiality prediction. In addition, we evaluate the available online bacterial essential gene prediction servers based on the experimentally validated essential gene sets of 30 bacteria from DEG. This article is intended to be a quick reference guide for the microbiologists interested in the essential genes.

More from: GigaScience
  • New
  • Research Article
  • 10.1093/gigascience/giaf145
Giant chromosomes of a tiny plant - the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).
  • Nov 29, 2025
  • GigaScience
  • Joanna Szablińska-Piernik + 2 more

  • New
  • Research Article
  • 10.1093/gigascience/giaf146
Segmentation-Based Quality Control of Structural MRI using the CAT12 Toolbox.
  • Nov 29, 2025
  • GigaScience
  • Robert Dahnke + 4 more

  • New
  • Research Article
  • 10.1093/gigascience/giaf144
Cervical Whole Slide Images Dataset for Multi-class Classification.
  • Nov 29, 2025
  • GigaScience
  • Mahnaz Mohammadi + 13 more

  • Research Article
  • 10.1093/gigascience/giaf143
A high-quality chromosome-level genome assembly of the oligophagous fruit fly Bactrocera tsuneonis (Diptera: Tephritidae) and insights into its host specificity.
  • Nov 20, 2025
  • GigaScience
  • Tengda Guo + 5 more

  • Research Article
  • 10.1093/gigascience/giaf142
Chromosome-level assemblies of two hexaploid bamboos Thyrsostachys oliveri and Thyrsostachys siamensis provide a foundation for functional and comparative genomics studies.
  • Nov 17, 2025
  • GigaScience
  • Chaiwat Naktang + 9 more

  • Research Article
  • 10.1093/gigascience/giaf141
pyRootHair: Machine Learning Accelerated Software for High-Throughput Phenotyping of Plant Root Hair Traits.
  • Nov 13, 2025
  • GigaScience
  • Ian Tsang + 5 more

  • Research Article
  • 10.1093/gigascience/giaf133
RNA-SeqEZPZ: A Point-and-Click Pipeline for Comprehensive Transcriptomics Analysis with Interactive Visualizations.
  • Nov 12, 2025
  • GigaScience
  • Cenny Taslim + 4 more

  • Research Article
  • 10.1093/gigascience/giaf140
NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data.
  • Nov 6, 2025
  • GigaScience
  • Fabian Woller + 4 more

  • Research Article
  • 10.1093/gigascience/giaf122
Complete end-to-end learning from protein feature representation to protein interactome inference
  • Nov 6, 2025
  • GigaScience
  • Yu-Hsin Chen + 3 more

  • Supplementary Content
  • 10.1093/gigascience/giaf113
Quantitative detection of DNA methylation from nanopore sequencing data without raw signals
  • Oct 31, 2025
  • GigaScience
  • Zhixing Feng + 4 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon