KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation.

Dapeng Wang,Jun Yu,Jiayue Xu

doi:10.1186/s13062-015-0083-4

Abstract

BackgroundThe K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison.ResultsTo meet increasing demands for comparing large genome sequences and to promote the use of the K-mer approach, we develop a versatile database, KGCAK (http://kgcak.big.ac.cn/KGCAK/), containing ~8,000 genomes that include genome sequences of diverse life forms (viruses, prokaryotes, protists, animals, and plants) and cellular organelles of eukaryotic lineages. It builds phylogeny based on genomic elements in an alignment-free fashion and provides in-depth data processing enabling users to compare the complexity of genome sequences based on K-mer distribution.ConclusionWe hope that KGCAK becomes a powerful tool for exploring relationship within and among groups of species in a tree of life based on genomic data.ReviewersThis article was reviewed by Prof Mark Ragan and Dr Yuri Wolf.

Highlights

Over the past few decades, phylogenies have often been built from defined evolutionarily-conserved gene families and occasionally from sequences of whole genomes
K-mer technique has been shown to be exceedingly effective in a variety of genomic applications, including genome assembly, motif discovery, repetitive sequence identification, and genome complexity assessment [2,3,4,5,6]
Genomes and gene annotations were acquired from Ensembl, Phytozome and NCBI genome databases

Summary

Introduction

Over the past few decades, phylogenies have often been built from defined evolutionarily-conserved gene families and occasionally from sequences of whole genomes. K-mer technique has been shown to be exceedingly effective in a variety of genomic applications, including genome assembly, motif discovery, repetitive sequence identification, and genome complexity assessment [2,3,4,5,6]. With the rapid accumulation of large genomic datasets in diverse species, the need for an easy-to-use database that stores and visualizes processed K-mer based data is obvious, and. Wang et al Biology Direct (2015) 10:53 genomes into easy-to-understand and visualized data from a comparative genomics perspective. The K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison

Objectives

Methods

Results