Abstract

In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.

Highlights

  • In recent years, k-mer-based analysis and comparison methods have become standard tools for the analysis of large DNA sequences such as chromosomes, whole genomes, or even metagenomes

  • The big advantage of k-mer-based methods compared to alignment-based methods such as the well-established

  • The result of a standard k-mer analysis is mostly only comprised of a number representing the similarity between each pair of sequences

Read more

Summary

Introduction

K-mer-based analysis and comparison methods have become standard tools for the analysis of large DNA sequences such as chromosomes, whole genomes, or even metagenomes. There are cases where they are still very unsatisfying when compared with those of other methods, for example, during the determination of the phylogenetic distance between two genomes, where the alignment of short DNA motifs such as highly conserved ribosomal RNA genes delivers very reliable results, while the results of k-mer methods are often uncertain [3]. Perhaps the main difference between the results of an alignment-based and k-mer-based method when used on the same data set (e.g., comparison of two genome sequences), is that the results of the alignment method can include the exact position (in bp) and quality of similarity of every part of the sequence within the data set. The result of a standard k-mer analysis is mostly only comprised of a number representing the similarity between each pair of sequences

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.