Abstract

Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.

Highlights

  • DNA k-mers are short recurring elements in the genomes of all living species. ese elements are both conserved and diverged across species owing to their functional significance, which enables these k-mer signatures ideal for species identification

  • Summary statistics for all k-mer lengths are available in Table 3. e number of common k-mers to all twelve species is depicted in Figure 8 (37, 344, and 1890 for motif lengths [7,8,9] bp) and is listed in Supplementary File 8, where the CC matrix is available for k-mer lengths [7,8,9] bp

Read more

Summary

Introduction

DNA k-mers are short recurring elements in the genomes of all living species. ese elements are both conserved and diverged across species owing to their functional significance, which enables these k-mer signatures ideal for species identification. DNA k-mers are short recurring elements in the genomes of all living species. Several recent studies have described the distribution of statistically significant k-mers in the genomes and several regulatory subregions (core, proximal, distal promoters, and 3′ and 5′ UTRs) in a small number of plant species as well as modern and archaic humans [1–3]. K-mers can be part of core segments of transcription factor binding sites or regulatory elements that take part in protein binding and gene regulation in different subregions of the genome. E present version of the algorithm is an alignment-free k-mer sequence comparison method. Such methods involve statistical analysis and comparison of k-mers between the genomes of two species. Such methods involve statistical analysis and comparison of k-mers between the genomes of two species. ese methods vary in the statistical measures applied, such as the comparison of word frequency, incorporation of information theory, universal sequence maps, and the measurement of complexity [4]. e advantages of k-mer-based alignment-free methods over alignment-based phylogenetic algorithms are that they can process the data much faster and eliminate biases that could be induced by using a priori-defined guide trees when performing the alignment, and subjective selection of alignment scoring parameters, such as gap opening and extension [5, 6].

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call