Abstract

Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.

Highlights

  • Genome sequences have been used widely for species identification with high accuracy and have been useful to many research areas in the biotechnological (Costessi et al, 2018), environmental (Vandenkoornhuyse et al, 2010), evolutionary (Bruger and Marx, 2018; Sands, 2019), and clinical sciences (Balloux et al, 2018)

  • The empirically optimal k-mer length was calculated based on our three-step approach (Zhang et al, 2017; see Figure 1B): step (1) we selected k-mers length that gave cumulative relative entropy (CRE) < 10% of the maximum to define the lower bound of k-mer length; step (2) we selected k-mers length that gave average number of common features (ACF) > 10% of the maximum to define the upper bound of k-mer length; and step (3) we selected k-mer length within the minimum and the maximum of k-mer length that yield the highest diversity index (H) based on observed common features (OCF)

  • We used a random sampling approach to perform the iterative calculations across considered k-mer lengths on subsets of all genomes/subsample (G genomes) several times (N times)

Read more

Summary

Introduction

Genome sequences have been used widely for species identification with high accuracy and have been useful to many research areas in the biotechnological (Costessi et al, 2018), environmental (Vandenkoornhuyse et al, 2010), evolutionary (Bruger and Marx, 2018; Sands, 2019), and clinical sciences (Balloux et al, 2018). The enormous amount of data generated by sequencing has made it challenging to compare sequences with alignment-based approaches such as BLAST (Altschul et al, 1990). The alignment-based approach generally requires significant memory and is time consuming, making the comparison of multi-genome-scale sequence data infeasible. Alignment-free methods for biological sequence analysis have been developed and perform well for comparative genomics and metagenomics, while being less time consuming than alignment-based methods (Ren et al, 2018). K-mer length is a very important parameter in alignment-free phylogenetic inference (Bernard et al, 2019)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.