Feature selection for gene prediction in metagenomic fragments

Amani Al-Ajlan,Achraf El Allali

doi:10.1186/s13040-018-0170-z

Abstract

BackgroundComputational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences.ResultsIn this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content.ConclusionOur proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.

Highlights

Computational approaches, machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction
Performance measures Gene prediction performance is measured by comparing the model prediction with the true gene annotation in fragments that were obtained from GenBank [25]
We count the number of true positives, false positives, and false negatives

Summary

Introduction

Computational approaches, machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Al-Ajlan and El Allali BioData Mining (2018) 11:9 studies identified genes through reliable experiments on living cells and organisms. It is usually an expensive and time-consuming task [8]. Content-based methods try to overcome these limitations using statistical approaches to detect variations between coding and non-coding regions [1, 8]. While, these approaches are very successful in genomic sequences, there is still work to be done for metagenomics due to the nature of the data [6, 12]. The greatest challenges for gene prediction algorithms in metagenomics are the short read-length and the incomplete and fragmented nature of the data [1, 13]

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BioData Mining	Publication Date: Jun 7, 2018
Citations: 12	License type: open-access

R Discovery Prime

R Discovery Prime

Feature selection for gene prediction in metagenomic fragments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioData Mining

Lead the way for us

Similar Papers

The Effect of Machine Learning Algorithms on Metagenomics Gene Prediction
Amani Al-Ajlan ... Achraf El Allali
-
Amani Al-Ajlan, et. al.Amani Al-Ajlan ... Achraf El Allali
27 Dec 2018
27 Dec 2018

The SVM Binary Tree Classification Using MRMR and F-Score Feature Selection Algorithms
Jozef Vavrek ... Jozef Juhár
Acta Electrotechnica et Informatica | VOL. 14
Jozef Vavrek, et. al.Jozef Vavrek ... Jozef Juhár
01 Jun 2014
Acta Electrotechnica et Informatica | VOL. 14

MINIMUM REDUNDANCY FEATURE SELECTION FROM MICROARRAY GENE EXPRESSION DATA
Chris Ding ... Hanchuan Peng
Journal of Bioinformatics and Computational Biology | VOL. 03
Chris Ding, et. al.Chris Ding ... Hanchuan Peng
01 Apr 2005
Journal of Bioinformatics and Computational Biology | VOL. 03

Analysis of the influence of Minimum Redundancy Maximum Relevance as dimensionality reduction method on cancer classification based on microarray data using Support Vector Machine classifier
Firda Aminy Ma’Ruf ... Untari Novia Wisesty
Journal of Physics: Conference Series | VOL. 1192
Firda Aminy Ma’Ruf, et. al.Firda Aminy Ma’Ruf ... Untari Novia Wisesty
01 Mar 2019
Journal of Physics: Conference Series | VOL. 1192

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature selection for gene prediction in metagenomic fragments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioData Mining