Abstract

While variant identification pipelines are becoming increasingly standardized, less attention has been paid to the pre-processing of variants prior to their use in bacterial genome-wide association studies (bGWAS). Three nuances of variant pre-processing that impact downstream identification of genetic associations include the separation of variants at multiallelic sites, separation of variants in overlapping genes, and referencing of variants relative to ancestral alleles. Here we demonstrate the importance of these variant pre-processing steps on diverse bacterial genomic datasets and present prewas, an R package, that standardizes the pre-processing of multiallelic sites, overlapping genes, and reference alleles before bGWAS. This package facilitates improved reproducibility and interpretability of bGWAS results. prewas enables users to extract maximal information from bGWAS by implementing multi-line representation for multiallelic sites and variants in overlapping genes. prewas outputs a binary SNP matrix that can be used for SNP-based bGWAS and will prevent the masking of minor alleles during bGWAS analysis. The optional binary gene matrix output can be used for gene-based bGWAS, which will enable users to maximize the power and evolutionary interpretability of their bGWAS studies. prewas is available for download from GitHub.

Highlights

  • Bacterial genome-­wide association studies are frequently used to identify genetic variants associated with variation in microbial phenotypes such as antibiotic resistance, host specificity and virulence [1,2,3,4]. bacterial genome-­wide association studies (bGWAS) methods can be classified into two general categories: those that use k-­length nucleotide sequences as features (e.g. [3, 5,6,7]), and those that use defined variant classes such as SNPs, gene presence/absence, or insertions/deletions as features (e.g. [4, 8,9,10,11,12])

  • To maximize the potential for identifying genetic variation associated with a given phenotype using bGWAS, care must be taken in the pre-­processing stage

  • A multiallelic locus is a site in the genome with more than two alleles present and encompases both triallelic and quadallelic sites. bGWAS typically requires a binary input for each genotype (e.g. 3,4), and multiallelic sites are, by definition, not binary

Read more

Summary

Introduction

Bacterial genome-­wide association studies (bGWAS) are frequently used to identify genetic variants associated with variation in microbial phenotypes such as antibiotic resistance, host specificity and virulence [1,2,3,4]. bGWAS methods can be classified into two general categories: those that use k-­length nucleotide sequences (kmers) as features (e.g. [3, 5,6,7]), and those that use defined variant classes such as SNPs, gene presence/absence, or insertions/deletions (indels) as features (e.g. [4, 8,9,10,11,12]). To determine the importance of variant pre-­processing methods for bGWAS, we investigated the prevalence of multiallelic sites, mismatches in reference allele choice, and SNPs in overlapping genes in nine bacterial datasets. Our analysis indicates that multiallelic sites are common in large, diverse bacterial datasets, there are frequently mismatches between different reference allele choices, and SNPs in overlapping genes often have discordant functional impacts. We discuss the benefits and drawbacks of various variant pre-­ processing decisions and present the R package prewas to standardize SNP pre-­processing, to incorporate multiallelic sites and prepare the data for gene-­ based analyses. We demonstrate the importance of these considerations by highlighting the prevalence of multiallelic sites and SNPs in overlapping genes within diverse bacterial genomes and the impact of reference allele choice on gene-­based analyses. The output of prewas can be directly input into bGWAS tools that require a binary matrix as an input (e.g. [3, 4]). prewas can be downloaded from GitHub

Methods
Results and discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.