Abstract

Machine learning algorithms (MLAs) have recently been applied to predict gene mutations of Escherichia coli (E. coli) under different exposure conditions, with room for improvement in performance. In a bid to improve performance, we hypothesize that incorporating the interactions between genes will help MLAs make better predictions. To investigate this, we integrated protein-coding gene cofunctional networks into a mutation dataset of E. coli exposed to different conditions. Also, we proposed a feature-selection algorithm based on gene cofunctional networks to pick the most relevant exposure conditions. Then, we used the extended dataset to train a support vector classifier, an artificial neural network, and an ensemble of both MLAs. Separate models were trained for each of the protein-coding genes. Validation results showed that our approach improved both the area under the receiver operating characteristic (ROC) curve (AUC) and the area under the precision-recall curve (AUPRC). A peak increase of 8.20% in AUPRC was observed. A similar analysis on selected genes, with ten or more mutation points for each gene, also showed improvement in the general performance of the MLAs. Out-of-sample testing on adaptive laboratory evolution experiments curated from the literature provided further evidence of an enhanced mutation-prediction performance, where a maximum 8.74% boost in the AUC was observed. Finally, we highlighted the genes with the most improved and most degraded predictions due to the additional information of the cofunctional genes. This work suggests that the functional relationship between genes may play a role in gene mutation and illustrates how the relationships might help to improve mutation prediction.

Highlights

  • We propose a feature-selection process based on the feature significance of the exposure-condition features of each gene and its cofunctional genes

  • We evaluate our models on five adaptive laboratory evolution (ALE) experiments curated from the literature, and present the analysis of the results obtained

  • From the AUC and area under the precision-recall curve (AUPRC) of both the validation process and the out-of-sample testing, we subsequently investigate the impact of our feature selection and the expanded dataset

Read more

Summary

INTRODUCTION

This work focuses on training machine models with data of mutations derived from the state-of-the-art pipelines on genomic sequencing with the aim of predicting mutations for novel exposure conditions. MLAs are good candidates for learning and predicting complex processes and have been applied to predict various biological phenomena These encompases investigation into the relatedness of genes, long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) [4], [5], predicting phenotypic characteristics and environmental conditions from gene expression profiles [6], antibiotic resistance acquisition [7], analysis of cancers, survival outcome and disease-pathway associations [8]–[10], diagnosis of mutations in epidermal growth factors [11], and classifying protein binding activity to DNA [12], [13]. We describe the feature-selection process utilized and oversampling procedure

FEATURE SELECTION
17: Add FSielected to FSelected
MODEL DESCRIPTION
ENSEMBLE MODEL
MODEL TRAINING AND TESTING
RESULTS AND DISCUSSION
PERFORMANCE METRICS
VALIDATION RESULTS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call