Abstract

Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.

Highlights

  • Essential genes and their encoded proteins are regarded as the bases of life because they are indispensable for the survival of organisms [1]

  • We addressed the above-mentioned problems in a benchmark dataset in Archaea by extracting intrinsic features from DNA sequence information (biological sub-word features that have been employed from natural language processing (NLP) techniques [31,32,33]) and subsequently training our ensemble learning-based classifier to make the prediction efficiency better and faster

  • Since our model was created by combining different algorithms from NLP, machine learning, and deep learning, we performed pre-experiments on our data to find the optimal parameters of each model

Read more

Summary

Introduction

Essential genes and their encoded proteins are regarded as the bases of life because they are indispensable for the survival of organisms [1]. A comprehensive understanding of essential genes can enable scientists to elucidate the biological nature of microorganisms [4], produce minimal gene subsets [5], develop promising drug targets, and generate potential medicines to combat infectious diseases [6]. Because of their importance, studies of essential genes have been considered crucial in genomics and bioinformatics. E.g., single-gene knockout approaches [7], conditional gene knockouts [8], transposition mutagenesis [9], and RNA interference [10] have been developed to characterize essential genes in microorganisms. Though these experimental methods have many advantages and are relatively reliable, they are still costly and time-consuming

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call