Abstract

BackgroundThe current bovine genomic reference sequence was assembled from a Hereford cow. The resulting linear assembly lacks diversity because it does not contain allelic variation, a drawback of linear references that causes reference allele bias. High nucleotide diversity and the separation of individuals by hundreds of breeds make cattle ideally suited to investigate the optimal composition of variation-aware references.ResultsWe augment the bovine linear reference sequence (ARS-UCD1.2) with variants filtered for allele frequency in dairy (Brown Swiss, Holstein) and dual-purpose (Fleckvieh, Original Braunvieh) cattle breeds to construct either breed-specific or pan-genome reference graphs using the vg toolkit. We find that read mapping is more accurate to variation-aware than linear references if pre-selected variants are used to construct the genome graphs. Graphs that contain random variants do not improve read mapping over the linear reference sequence. Breed-specific augmented and pan-genome graphs enable almost similar mapping accuracy improvements over the linear reference. We construct a whole-genome graph that contains the Hereford-based reference sequence and 14 million alleles that have alternate allele frequency greater than 0.03 in the Brown Swiss cattle breed. Our novel variation-aware reference facilitates accurate read mapping and unbiased sequence variant genotyping for SNPs and Indels.ConclusionsWe develop the first variation-aware reference graph for an agricultural animal (https://doi.org/10.5281/zenodo.3759712). Our novel reference structure improves sequence read mapping and variant genotyping over the linear reference. Our work is a first step towards the transition from linear to variation-aware reference structures in species with high genetic diversity and many sub-populations.

Highlights

  • A reference sequence is an assembly of digital nucleotides that are representative for a species’ genetic constitution

  • Our novel variation-aware reference facilitates accurate read mapping and unbiased sequence variant genotyping for singlenucleotide polymorphisms (SNPs) and insertions and deletions (Indels)

  • We show that breed-specific augmented and pan-genome graphs allow for significant read mapping accuracy improvements over linear reference sequences

Read more

Summary

Introduction

A reference sequence is an assembly of digital nucleotides that are representative for a species’ genetic constitution. Discovery and genotyping of polymorphic sites from whole-genome sequencing data typically involve reference-guided alignment and genotyping steps that are carried out successively [1]. Variants are discovered at positions where aligned sequencing reads differ from corresponding reference nucleotides. Longread sequencing and sophisticated genome assembly methods enabled spectacular improvements in the quality of linear reference sequences for species with gigabase-sized genomes [2]. Generated de novo assemblies exceed in quality and continuity all current reference sequences [3, 4]. The current bovine genomic reference sequence was assembled from a Hereford cow. High nucleotide diversity and the separation of individuals by hundreds of breeds make cattle ideally suited to investigate the optimal composition of variationaware references

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call