Personalized and graph genomes reveal missing signal in epigenomic data

Cristian Groza,Nicole Soranzo,Tony Kwan,Tomi Pastinen,Guillaume Bourque

doi:10.1186/s13059-020-02038-8

Abstract

BackgroundEpigenomic studies that use next generation sequencing experiments typically rely on the alignment of reads to a reference sequence. However, because of genetic diversity and the diploid nature of the human genome, we hypothesize that using a generic reference could lead to incorrectly mapped reads and bias downstream results.ResultsWe show that accounting for genetic variation using a modified reference genome or a de novo assembled genome can alter histone H3K4me1 and H3K27ac ChIP-seq peak calls either by creating new personal peaks or by the loss of reference peaks. Using permissive cutoffs, modified reference genomes are found to alter approximately 1% of peak calls while de novo assembled genomes alter up to 5% of peaks. We also show statistically significant differences in the amount of reads observed in regions associated with the new, altered, and unchanged peaks. We report that short insertions and deletions (indels), followed by single nucleotide variants (SNVs), have the highest probability of modifying peak calls. We show that using a graph personalized genome represents a reasonable compromise between modified reference genomes and de novo assembled genomes. We demonstrate that altered peaks have a genomic distribution typical of other peaks.ConclusionsAnalyzing epigenomic datasets with personalized and graph genomes allows the recovery of new peaks enriched for indels and SNVs. These altered peaks are more likely to differ between individuals and, as such, could be relevant in the study of various human phenotypes.

Highlights

Epigenomic studies that use generation sequencing experiments typically rely on the alignment of reads to a reference sequence
We wanted to estimate the proportion of changed mappings and noted that 3.6% of whole-genome sequencing (WGS) reads move depending on the reference that is used (Additional file 1: Table S1a)
Personal-only peaks emerge when reads shift their mapping from the reference pileup to the new personalized pileup or when reads that did not map to the reference become mapped to the personalized genome

Summary

Introduction

Epigenomic studies that use generation sequencing experiments typically rely on the alignment of reads to a reference sequence. Because of genetic diversity and the diploid nature of the human genome, we hypothesize that using a generic reference could lead to incorrectly mapped reads and bias downstream results. Standard ChIP-seq analysis relies on aligning reads to a reference sequence followed by peak calling [1, 2]. Differences between the genome under study and the reference will shift the mapping of some reads and generate unmapped reads (Fig. 1a), a phenomenon known as reference bias [5]. It has already been shown that just changing the assembly version of the reference can affect epigenomic analyses [6]

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genome Biology	Publication Date: May 25, 2020
Citations: 33	License type: open-access

R Discovery Prime

R Discovery Prime

Personalized and graph genomes reveal missing signal in epigenomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology

Lead the way for us

Similar Papers

Personalized and graph genomes reveal missing signal in epigenomic data
...
F1000Research | VOL. 8
, et. al. ...
11 Nov 2019
F1000Research | VOL. 8

Identification Of Leukemia-Specific Mutations For Detection Of Minimal Residual Disease In Acute Myeloid Leukemia Using Cell Sorting and Whole Exome Sequencing
Linda Fogelstrand ... Lars Palmqvist
Blood | VOL. 122
Linda Fogelstrand, et. al.Linda Fogelstrand ... Lars Palmqvist
15 Nov 2013
Blood | VOL. 122

Epigenetics in health and disease: heralding the EWAS era
Therese M Murphy ... Jonathan Mill
The Lancet | VOL. 383
Therese M Murphy, et. al.Therese M Murphy ... Jonathan Mill
13 Mar 2014
The Lancet | VOL. 383

Author response: Genetic variation in ALDH4A1 is associated with muscle health over the lifespan and across species
Nicole L Stuhr ... Eileen M Crimmins
-
Nicole L Stuhr, et. al.Nicole L Stuhr ... Eileen M Crimmins
20 Jan 2022
20 Jan 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Personalized and graph genomes reveal missing signal in epigenomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology