Nitrogen is essential for life and its transformations are an important part of the global biogeochemical cycle. Being an essential nutrient, nitrogen exists in a range of oxidation states from +5 (nitrate) to -3 (ammonium and amino-nitrogen), and its oxidation and reduction reactions catalyzed by microbial enzymes determine its environmental fate. The functional annotation of the genes encoding the core nitrogen network enzymes has a broad range of applications in metagenomics, agriculture, wastewater treatment and industrial biotechnology. This study developed an alignment-free computational approach to determine the predicted nitrogen biochemical network-related enzymes from the sequence itself. We propose deepNEC, a novel end-to-end feature selection and classification model training approach for nitrogen biochemical network-related enzyme prediction. The algorithm was developed using Deep Learning, a class of machine learning algorithms that uses multiple layers to extract higher-level features from the raw input data. The derived protein sequence is used as an input, extracting sequential and convolutional features from raw encoded protein sequences based on classification rather than traditional alignment-based methods for enzyme prediction. Two large datasets of protein sequences, enzymes and non-enzymes were used to train the models with protein sequence features like amino acid composition, dipeptide composition (DPC), conformation transition and distribution, normalized Moreau-Broto (NMBroto), conjoint and quasi order, etc. The k-fold cross-validation and independent testing were performed to validate our model training. deepNEC uses a four-tier approach for prediction; in the first phase, it will predict a query sequence as enzyme or non-enzyme; in the second phase, it will further predict and classify enzymes into nitrogen biochemical network-related enzymes or non-nitrogen metabolism enzymes; in the third phase, it classifies predicted enzymes into nine nitrogen metabolism classes; and in the fourth phase, it predicts the enzyme commission number out of 20 classes for nitrogen metabolism. Among all, the DPC + NMBroto hybrid feature gave the best prediction performance (accuracy of 96.15% in k-fold training and 93.43% in independent testing) with an Matthews correlation coefficient (0.92 training and 0.87 independent testing) in phase I; phase II (accuracy of 99.71% in k-fold training and 98.30% in independent testing); phase III (overall accuracy of 99.03% in k-fold training and 98.98% in independent testing); phase IV (overall accuracy of 99.05% in k-fold training and 98.18% in independent testing), the DPC feature gave the best prediction performance. We have also implemented a homology-based method to remove false negatives. All the models have been implemented on a web server (prediction tool), which is freely available at http://bioinfo.usu.edu/deepNEC/.
Read full abstract