A cost-effective strategy to obtain ultra-dense genomic information is to sequence part of population and perform imputation from lower density genotypes to sequence level for the remaining animals. The aims of this study were to evaluate the feasibility of genotype imputation from medium density to sequence level in Nile tilapia and to investigate the impacts of size and origin of reference population in the accuracy of imputation. Genomic DNA was extracted from fin-clip samples of 326 animals from 3 different populations (PA, PB and PC). After sequencing, alignment, variant calling and quality control of genotypes, approximately 4.6 million of single-nucleotide polymorphisms (SNPs) in common to all populations were retained and used for further imputation analyses. Four scenarios were evaluated to assess imputation accuracy on each population, including: two reference sizes (10 or 90% of animals of each reference population) and two reference origins (two different populations only or all three populations used as reference). The animals in the validation set had part of their genotypes masked keeping only 49,216 SNPs available and the accuracy of imputation was assessed using the correlation between the imputed and observed genotypes (R2). Imputation was carried out using FImpute3 software. At individual level, the R2 showed intermediate values ranging from 0.37 ± 0.04 to 0.56 ± 0.07 for PA, 0.43 ± 0.05 to 0.58 ± 0.08 for PB and 0.43 ± 0.05 to 0.58 ± 0.07 for PC. An increase in the R2 was observed when 90% of animals from the same population were used as reference in comparison to only 10% (0.37 ± 0.04 to 0.54 ± 0.07 for PA, 0.43 ± 0.05 to 0.57 ± 0.07 for PB and 0.43 ± 0.05 to 0.58 ± 0.07 for PC). At SNP level, the use of all three populations as reference yielded the best results in terms of number of SNPs imputed with accuracy greater than 0.8. On average, 676,233 ± 142,291, 666,559 ± 52,648 and 592,187 ± 89,663 SNPs were imputed with accuracy >0.8 for PA, PB and PC, respectively. Considering only these highly accurate imputed SNPs, the average imputation accuracy of samples was equal to 0.95 ± 0.06 for PA and 0.92 ± 0.07 for PB and PC, for scenarios that included more animals as reference (90% of same population as reference, two and three populations). There were no significant differences for R2 between scenarios that used 90% of animals from the same population and used animals from the three population as reference showing that the strategy of using information from other population to increase the reference population had minor effect on accuracy of imputation. In conclusion, it was feasible to impute from 50 K to approximately 700 K with high accuracy using tilapia sequence data. We also expect that the use of more animals from these populations or animals from ascending lines as reference could help in the imputation process to obtain millions of imputed SNPs with high accuracy.
Read full abstract