Abstract

Most single-nucleotide polymorphisms (SNPs) are located in non-coding regions, but the fraction usually studied is harbored in protein-coding regions because potential impacts on proteins are relatively easy to predict by popular tools such as the Variant Effect Predictor. These tools annotate variants independently without considering the potential effect of grouped or haplotypic variations, often called “multi-nucleotide variants” (MNVs). Here, we used a large RNA-seq dataset to survey MNVs, comprising 382 chicken samples originating from 11 populations analyzed in the companion paper in which 9.5M SNPs— including 3.3M SNPs with reliable genotypes—were detected. We focused our study on in-codon MNVs and evaluate their potential mis-annotation. Using GATK HaplotypeCaller read-based phasing results, we identified 2,965 MNVs observed in at least five individuals located in 1,792 genes. We found 41.1% of them showing a novel impact when compared to the effect of their constituent SNPs analyzed separately. The biggest impact variation flux concerns the originally annotated stop-gained consequences, for which around 95% were rescued; this flux is followed by the missense consequences for which 37% were reannotated with a different amino acid. We then present in more depth the rescued stop-gained MNVs and give an illustration in the SLC27A4 gene. As previously shown in human datasets, our results in chicken demonstrate the value of haplotype-aware variant annotation, and the interest to consider MNVs in the coding region, particularly when searching for severe functional consequence such as stop-gained variants.

Highlights

  • Next-generation sequencing has given access to genomes at the nucleotide level through DNA-seq and to expressed regions by whole-exome sequencing (WES, originally focusing on exonic parts of the genome) or RNA-seq

  • Using 3.3M single-nucleotide polymorphisms (SNPs) previously detected from 767 multi-tissue RNA-seq of 382 animals from 11 chicken populations and enriched in coding regions [see the companion paper (Jehl et al, 2021), section “Materials and Methods”], we identified 260,919 unique SNPs in 26,702 transcripts corresponding to 15,835 genes out of 19,545 protein-coding genes (Figure 2, right part—in yellow)

  • We found 11,183 SNPs (4.3% of the SNPs in codons) as constituent variants of 5,533 multi-nucleotide variants” (MNVs), which corresponded to 4,415 transcripts and 2,916 genes

Read more

Summary

Introduction

Next-generation sequencing has given access to genomes at the nucleotide level through DNA-seq and to expressed regions by whole-exome sequencing (WES, originally focusing on exonic parts of the genome) or RNA-seq These data enable us to call genetic variations by spotting differences between aligned reads and the species reference genome or among aligned. Different popular tools have been developed this last decade to predict SNPs’ effects on proteins such as Variant Effect Predictor (VEP) (McLaren et al, 2016), SnpEff (Cingolani et al, 2012), or ANNOtate VARiation (ANNOVAR) (Wang et al, 2010) These tools consider each variation location individually, as if it they were specific to “reference” nucleotides. MNV identification tools have been developed using different methods for phasing SNPs [MAC (Wei et al, 2015), varDic (Lai et al, 2016), COPE (Cheng et al, 2017), BCFtools (Danecek and McCarthy, 2017), and MACARON (Khan et al, 2018)] and have been applied to different human genetic variant datasets [1,000 Genomes Project dataset (Cheng et al, 2017; Danecek and McCarthy, 2017; Khan et al, 2018; Wang et al, 2020), ExAC (Lek et al, 2016), The Cancer Genome Atlas (Lai et al, 2016), or gnomAD consortium (Wang et al, 2020)], mainly based on exomes

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call