Genotype-Corrector: improved genotype calls for genetic mapping in F2 and RIL populations

James C Schnable,Jingping Fang,Jinliang Yang,Delin Li,Xingtan Zhang,Chenyong Miao,Haibao Tang,Pingping Liang

doi:10.1038/s41598-018-28294-0

Abstract

F2 and recombinant inbred lines (RILs) populations are very commonly used in plant genetic mapping studies. Although genome-wide genetic markers like single nucleotide polymorphisms (SNPs) can be readily identified by a wide array of methods, accurate genotype calling remains challenging, especially for heterozygous loci and missing data due to low sequencing coverage per individual. Therefore, we developed Genotype-Corrector, a program that corrects genotype calls and imputes missing data to improve the accuracy of genetic mapping. Genotype-Corrector can be applied in a wide variety of genetic mapping studies that are based on low coverage whole genome sequencing (WGS) or Genotyping-by-Sequencing (GBS) related techniques. Our results show that Genotype-Corrector achieves high accuracy when applied to both synthetic and real genotype data. Compared with using raw or only imputed genotype calls, the linkage groups built by corrected genotype data show much less noise and significant distortions can be corrected. Additionally, Genotype-Corrector compares favorably to the popular imputation software LinkImpute and Beagle in both F2 and RIL populations. Genotype-Corrector is publicly available on GitHub at https://github.com/freemao/Genotype-Corrector.

Highlights

With the availability of high-throughput sequencing (HTS) technology, it is straightforward to identify and score large numbers of single nucleotide polymorphisms (SNPs) variants segregating in mapping populations
Our results showed that Genotype-Corrector can improve the accuracy of segregation datasets in F2 and recombinant inbred lines (RILs) populations genotyped by mainstream genotyping methods
We evaluated the accuracy of Genotype-Corrector by using a simulated F2 population and a Medicago truncatula RIL population

Summary

Introduction

With the availability of high-throughput sequencing (HTS) technology, it is straightforward to identify and score large numbers of SNP variants segregating in mapping populations. For methods based on WGS or targeted sequencing, large genome size and sub-optimal amounts of sequencing data generated per sample (as a result of limited budgets) can produce a relatively low depth of coverage at certain loci Such low sequencing coverage often leads to inaccurate genotype calls, especially in heterozygous regions which require deeper coverage to identify both alleles[8]. LinkImpute[17] and FILLIN18 are optimized for low-coverage sequencing data in plants[19,20] All of this software uses robust statistical methods to address the missing data problem in diverse populations and can be adapted to more structured populations such as F2 or RIL. The constructed linkage groups using our corrected genotype data are much cleaner compared to the original linkage groups and any significant distortions were corrected as a result of running the software

Methods

Results

Conclusion