Abstract

We have developed a computational method that counts the frequencies of unique k-mers in FASTQ-formatted genome data and uses this information to infer the genotypes of known variants. FastGT can detect the variants in a 30x genome in less than 1 hour using ordinary low-cost server hardware. The overall concordance with the genotypes of two Illumina “Platinum” genomes is 99.96%, and the concordance with the genotypes of the Illumina HumanOmniExpress is 99.82%. Our method provides k-mer database that can be used for the simultaneous genotyping of approximately 30 million single nucleotide variants (SNVs), including >23,000 SNVs from Y chromosome. The source code of FastGT software is available at GitHub (https://github.com/bioinfo-ut/GenomeTester4/).

Highlights

  • Next-generation sequencing (NGS) technologies are widely used for studying genome variation

  • Every bi-allelic single nucleotide variant (SNV) position in the genome is covered by k k-mer pairs, where pair is formed

  • FastGT relies on the assumption that at least a number of these k-mer pairs are unique and appear exclusively in this location of the genome; the occurrence counts of these unique k-mer pairs in sequencing data can be used to identify the genotype of this variant in a specific individual

Read more

Summary

Introduction

Next-generation sequencing (NGS) technologies are widely used for studying genome variation. One recent publication has described an alignment-free SNV calling method that is based on counting the frequency of k-mers[23] This method converts sequences from raw reads into Burrows-Wheeler transform and calls genotypes by counting using a variable-length unique substring surrounding the variant. The method only uses reliable regions of the genome and is approximately 1–2 orders of magnitude faster than traditional mapping-based genotype detection. FastGT is currently limited to the calling of previously known genomic variants because specific k-mers must be pre-selected for all known alleles It is not a substitute for traditional mapping and variant calling but a complementary method that facilitates certain aspects of NGS-based genome analyses. Our method is based on three original components: (1) the procedure for the selection of unique k-mers, (2) the customized data structure for storing and counting k-mers directly from a FASTQ file, and (3) a maximum likelihood method designed for estimating genotypes from k-mer counts

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call