NgsJulia: population genetic analysis of next-generation DNA sequencing data with Julia language

Alex Mas-Sandoval,Chenyu Jin,Marco Fracassetti,Matteo Fumagalli

doi:10.12688/f1000research.104368.1

Abstract

A sound analysis of DNA sequencing data is important to extract meaningful information and infer quantities of interest. Sequencing and mapping errors coupled with low and variable coverage hamper the identification of genotypes and variants and the estimation of population genetic parameters. Methods and implementations to estimate population genetic parameters from sequencing data available nowadays either are suitable for the analysis of genomes from model organisms only, require moderate sequencing coverage, or are not easily adaptable to specific applications. To address these issues, we introduce ngsJulia, a collection of templates and functions in Julia language to process short-read sequencing data for population genetic analysis. We further describe two implementations, ngsPool and ngsPloidy, for the analysis of pooled sequencing data and polyploid genomes, respectively. Through simulations, we illustrate the performance of estimating various population genetic parameters using these implementations, using both established and novel statistical methods. These results inform on optimal experimental design and demonstrate the applicabil- ity of methods in ngsJulia to estimate parameters of interest even from low coverage sequencing data. ngsJulia provide users with a flexible and efficient framework for ad hoc analysis of sequencing data.ngsJulia is available from: https://github.com/mfumagalli/ngsJulia

Highlights

Population genetics, i.e. the study of genetic variation within and between groups, plays a central role in evolutionary inferences
Results ngsJulia implements data structures and functions for an easy calculation of nucleotide and genotype likelihoods which serve the basis of genotype and SNP calling and for the estimation of allele frequencies and other summary statistics
To demonstrate the use of ngsJulia, we provide two custom applications from its templates and functions

Summary

Introduction

Population genetics, i.e. the study of genetic variation within and between groups, plays a central role in evolutionary inferences. The quantification of genetic diversity serves the basis for the inference of neutral[1] and adaptive[2] events that characterised the history of different populations. The comparison of allele frequencies between groups (i.e. cases and controls) is an important aspect in biomedical and clinical sciences.[3]. In the last 20 years, next-generation sequencing (NGS) technologies allowed researchers to generate unprecedented amount of genomic data for a wide range of organisms.[4] This revolution transformed population genetics ( labelled as population genomics) to a data-driven discipline. Data produced by short-read sequencing machines (still the most accessible platform worldwide) consists of a collection of relatively short All observed sequenced reads are used to infer the per-sample genotype (an operation called ‘genotype calling’) and the intersamples variability, i.e. whether a particular site is polymorphic (an operation called ‘single-nucleotide polymorphism (SNP) calling’).[5]

Methods

Results

Conclusion