FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

Fanny-Dhelia Pajuste,Lauris Kaplinski,Märt Möls,Maido Remm,Tarmo Puurand,Maarja Lepamets

doi:10.1038/s41598-017-02487-5

Fanny-Dhelia Pajuste, Lauris Kaplinski + Show 4 more

Open Access

https://doi.org/10.1038/s41598-017-02487-5

Copy DOI

Journal: Scientific Reports	Publication Date: May 31, 2017
Citations: 40	License type: open-access

Affiliation: University of Tartu

Abstract

We have developed a computational method that counts the frequencies of unique k-mers in FASTQ-formatted genome data and uses this information to infer the genotypes of known variants. FastGT can detect the variants in a 30x genome in less than 1 hour using ordinary low-cost server hardware. The overall concordance with the genotypes of two Illumina “Platinum” genomes is 99.96%, and the concordance with the genotypes of the Illumina HumanOmniExpress is 99.82%. Our method provides k-mer database that can be used for the simultaneous genotyping of approximately 30 million single nucleotide variants (SNVs), including >23,000 SNVs from Y chromosome. The source code of FastGT software is available at GitHub (https://github.com/bioinfo-ut/GenomeTester4/).

Highlights

Next-generation sequencing (NGS) technologies are widely used for studying genome variation
Every bi-allelic single nucleotide variant (SNV) position in the genome is covered by k k-mer pairs, where pair is formed
FastGT relies on the assumption that at least a number of these k-mer pairs are unique and appear exclusively in this location of the genome; the occurrence counts of these unique k-mer pairs in sequencing data can be used to identify the genotype of this variant in a specific individual

Summary

Introduction

Next-generation sequencing (NGS) technologies are widely used for studying genome variation. One recent publication has described an alignment-free SNV calling method that is based on counting the frequency of k-mers[23] This method converts sequences from raw reads into Burrows-Wheeler transform and calls genotypes by counting using a variable-length unique substring surrounding the variant. The method only uses reliable regions of the genome and is approximately 1–2 orders of magnitude faster than traditional mapping-based genotype detection. FastGT is currently limited to the calling of previously known genomic variants because specific k-mers must be pre-selected for all known alleles It is not a substitute for traditional mapping and variant calling but a complementary method that facilitates certain aspects of NGS-based genome analyses. Our method is based on three original components: (1) the procedure for the selection of unique k-mers, (2) the customized data structure for storing and counting k-mers directly from a FASTQ file, and (3) a maximum likelihood method designed for estimating genotypes from k-mer counts

Methods

Results

Conclusion