Abstract

BackgroundThe accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.ResultsThrough this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the Mus musculus and Rattus norvegicus organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Nasonia vitripennis. The precision increases significantly by 39% and 22.9% for Mus musculus and Rattus norvegicus, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (Mus musculus) and from 47.45% to 88.09% (Rattus norvegicus).ConclusionsIn order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.

Highlights

  • The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation

  • Having identified that the problem of predicting the Translation Initiation Site (TIS) is highly imbalanced and that the oversampling methods, which have already been used in the present context, significantly increase computational complexity, this study proposes an undersampling class balancing method, M-Clus

  • Database Since the proposed method requires a large amount of testing, it was initially tested with the smaller databases, Mus musculus and Rattus Norvegicus, and expanded to organisms which have a larger amount of mRNA: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens

Read more

Summary

Introduction

The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. Obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Given a molecule of mRNA, a central problem of molecular biology is to determine whether it contains CDS and thereafter to discover which protein will be codified. The region of the mRNA sequence where the initiation of the protein synthesis process occurs is called the Translation Initiation Site (TIS). Control of the initiation of translation is one of the most important processes in the regulation of genetic expression [3]. A high level of accuracy of prediction could be useful for a better understanding of the protein obtained from the sequences of nucleotides [4]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.