Abstract

We present a fast, robust and parsimonious approach to detecting signals in an ordered sequence of numbers. Our motivation is in seeking a suitable method to take a sequence of scores corresponding to properties of positions in virus genomes, and find outlying regions of low scores. Suitable statistical methods without using complex models or making many assumptions are surprisingly lacking. We resolve this by developing a method that detects regions of low score within sequences of real numbers. The method makes no assumptions a priori about the length of such a region; it gives the explicit location of the region and scores it statistically. It does not use detailed mechanistic models so the method is fast and will be useful in a wide range of applications. We present our approach in detail, and test it on simulated sequences. We show that it is robust to a wide range of signal morphologies, and that it is able to capture multiple signals in the same sequence. Finally we apply it to viral genomic data to identify regions of evolutionary conservation within influenza and rotavirus.

Highlights

  • In this paper, we present a new method for detecting signals in an ordered sequence of numbers

  • We have presented a new method for finding signals

  • For datasets that are relatively small, say hundreds or thousands in length such as with the viral sequences that motivated this work, our method is computationally inexpensive for single application, and we have presented approaches for speeding up the computation that will be of Detecting signal regions in ordered sequences use when there are many or longer sequences to be analysed

Read more

Summary

Introduction

We present a new method for detecting signals in an ordered sequence of numbers. A signal is a run in the sequence where the values tend to be unusually low The development of this method was motivated by the need to identify ‘regions of interest’ in viral genomic data wherein methods are already established for assigning a score to individual codon positions [1]. Such scores indicate high or low variation between aligned sequences (equivalently low or high conservation between sequences), while taking into account codon usage across the genome and amino acid usage per site. These could represent the shadow of some additional feature that must be relatively conserved through evolution, such as a cis-acting signal or could indicate some sequence dependent element in an alternative reading frame

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.