Machine learning on normalized protein sequences.

Dominik Heider,Jens Verheyen,Daniel Hoffmann

doi:10.1186/1756-0500-4-94

Dominik Heider, Jens Verheyen + Show 1 more

Open Access

https://doi.org/10.1186/1756-0500-4-94

Copy DOI

Abstract

BackgroundMachine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.FindingsWe propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.ConclusionsWe found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.

Highlights

Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes
We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and is a promising alternative to existing methods, especially for protein sequences of variable length
The relative performance of each normalization procedure in comparison to each other is quite similar for all descriptors

Summary

Introduction

Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. Kernel functions return the inner product between the mapped data points in a higher dimensional space, and the special class of string kernels tries to match alignments of subsequences to build a higher dimensional feature space in which the sequences can be separated [13]. Another possible solution is the application of multiple sequence alignments [14] or multiple pairwise alignments to a reference sequence [15]. This introduces some artificial information that can bias predictions

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Research Notes	Publication Date: Mar 31, 2011
Citations: 69	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Machine learning on normalized protein sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes

Lead the way for us

Similar Papers

Automated prediction of HIV drug resistance from genotype data.
Chenhsiang Shen ... Irene T Weber
BMC Bioinformatics | VOL. Suppl 17 8
Chenhsiang Shen, et. al.Chenhsiang Shen ... Irene T Weber
01 Aug 2016
BMC Bioinformatics | VOL. Suppl 17 8

Maternal Human Immunodeficiency Virus (HIV) Drug Resistance Is Associated With Vertical Transmission and Is Prevalent in Infected Infants.
Ceejay L Boyce ... Patricia Demarrais
Clinical Infectious Diseases | VOL. 74
Ceejay L Boyce, et. al.Ceejay L Boyce ... Patricia Demarrais
01 Sep 2021
Clinical Infectious Diseases | VOL. 74

Human Immunodeficiency Virus (HIV) Drug Resistance: A Global Narrative Review
Maureen Nkandu Phiri ... Steward Mudenda
Journal of Biomedical Research & Environmental Sciences | VOL. 2
Maureen Nkandu Phiri, et. al.Maureen Nkandu Phiri ... Steward Mudenda
01 Sep 2021
Journal of Biomedical Research & Environmental Sciences | VOL. 2

Pretreatment Human Immunodeficiency Virus (HIV) Drug Resistance Among Treatment-Naive Infants Newly Diagnosed With HIV in 2016 in Namibia: Results of a Nationally Representative Study.
Michael R Jordan ... Eric J Dziuban
Open Forum Infectious Diseases | VOL. 9
Michael R Jordan, et. al.Michael R Jordan ... Eric J Dziuban
24 Mar 2022
Open Forum Infectious Diseases | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine learning on normalized protein sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes