Abstract
Protein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier's capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping 3-mer feature, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. First, 3-mer feature that counts the occurrence frequency of trinucleotides in a biological sequence only reflects local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. Second, kmer features of length k larger than three (e.g., hexamer) may also contain useful information. Based on the two points, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploit global sequence order information, non-overlapping gapped kmer (gkm) features and statistical dependencies among coding labels. 3-fold cross-validation tests on human and mouse biological sequence demonstrate that our proposed method significantly outperforms existing state-of-the-art methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.