KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.

Natapol Pornputtapong,Intawat Nookaew,Nipa Chokesajjawatee,Se-Ran Jun,Piroon Jenjaroenpun,Daniel A Acheampong,Preecha Patumcharoenpol,Thidathip Wongsurawat,Suganya Yongkiettrakul

doi:10.3389/fbioe.2020.556413

Abstract

Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.

Highlights

Genome sequences have been used widely for species identification with high accuracy and have been useful to many research areas in the biotechnological (Costessi et al, 2018), environmental (Vandenkoornhuyse et al, 2010), evolutionary (Bruger and Marx, 2018; Sands, 2019), and clinical sciences (Balloux et al, 2018)
The empirically optimal k-mer length was calculated based on our three-step approach (Zhang et al, 2017; see Figure 1B): step (1) we selected k-mers length that gave cumulative relative entropy (CRE) < 10% of the maximum to define the lower bound of k-mer length; step (2) we selected k-mers length that gave average number of common features (ACF) > 10% of the maximum to define the upper bound of k-mer length; and step (3) we selected k-mer length within the minimum and the maximum of k-mer length that yield the highest diversity index (H) based on observed common features (OCF)
We used a random sampling approach to perform the iterative calculations across considered k-mer lengths on subsets of all genomes/subsample (G genomes) several times (N times)

Summary

Introduction

Genome sequences have been used widely for species identification with high accuracy and have been useful to many research areas in the biotechnological (Costessi et al, 2018), environmental (Vandenkoornhuyse et al, 2010), evolutionary (Bruger and Marx, 2018; Sands, 2019), and clinical sciences (Balloux et al, 2018). The enormous amount of data generated by sequencing has made it challenging to compare sequences with alignment-based approaches such as BLAST (Altschul et al, 1990). The alignment-based approach generally requires significant memory and is time consuming, making the comparison of multi-genome-scale sequence data infeasible. Alignment-free methods for biological sequence analysis have been developed and perform well for comparative genomics and metagenomics, while being less time consuming than alignment-based methods (Ren et al, 2018). K-mer length is a very important parameter in alignment-free phylogenetic inference (Bernard et al, 2019)

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Bioengineering and Biotechnology	Publication Date: Sep 23, 2020
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Bioengineering and Biotechnology

Lead the way for us

Similar Papers

Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP.
Dai Yoshimura ... Takehiko Itoh
Microbial Genomics | VOL. 5
Dai Yoshimura, et. al.Dai Yoshimura ... Takehiko Itoh
01 May 2019
Microbial Genomics | VOL. 5

Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking.
Marcin Bogusz ... Simon Whelan
Systematic biology | VOL. 66
Marcin Bogusz, et. al.Marcin Bogusz ... Simon Whelan
14 Sep 2016
Systematic biology | VOL. 66

Fusang: a framework for phylogenetic tree inference via deep learning.
Zhicheng Wang ... Yongwei Xue
Nucleic Acids Research | VOL. 51
Zhicheng Wang, et. al.Zhicheng Wang ... Yongwei Xue
11 Oct 2023
Nucleic Acids Research | VOL. 51

K-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank.
Guillaume Bernard ... Mark A Ragan
mSystems | VOL. 3
Guillaume Bernard, et. al.Guillaume Bernard ... Mark A Ragan
30 Oct 2018
mSystems | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Bioengineering and Biotechnology