Abstract

How DNA sequence variation influences gene expression remains poorly understood. Diploid organisms have two homologous copies of their DNA sequence in the same nucleus, providing a rich source of information about how genetic variation affects a wealth of biochemical processes. However, few computational methods have been developed to discover allele specific differences in functional genomic data. Existing methods either treat each SNP independently, limiting statistical power, or combine SNPs across gene annotations, preventing the discovery of allele specific differences in unexpected genomic regions. Here we introduce AlleleHMM, a new computational method to identify blocks of neighboring SNPs that share similar allele specific differences in mark abundance. AlleleHMM uses a hidden Markov model to divide the genome into three hidden states based on allele frequencies in genomic data: a symmetric state (state S) which shows no difference between alleles, and regions with a higher signal on the maternal (state M) or paternal (state P) allele. AlleleHMM substantially outperformed naive methods using both simulated and real genomic data, particularly when input data had realistic levels of overdispersion. Using global run-on sequencing (GRO-seq) data, AlleleHMM identified thousands of allele specific blocks of transcription in both coding and non-coding genomic regions. AlleleHMM is a powerful tool for discovering allele specific regions in functional genomic datasets.

Highlights

  • DNA encodes the blueprints for making an organism, in part by coordinating a complex cell-type and conditionspecific gene expression program

  • We developed AlleleHMM to identify genomic regions that share allele specific differences in functional mark abundance

  • We show that AlleleHMM provides substantial improvements in both sensitivity and specificity for detecting allele specific single-nucleotide polymorphisms (SNPs) compared with existing computational tools, using both simulation studies and analyses of real global run-on sequencing (GRO-seq) data

Read more

Summary

Introduction

DNA encodes the blueprints for making an organism, in part by coordinating a complex cell-type and conditionspecific gene expression program. How DNA or RNA sequences control each step during transcription, mRNA processing, and mRNA degradation remains poorly understood. Finding allele specific differences in the distribution of marks along the genome is a powerful strategy for understanding the link between DNA sequence and the various biochemical processes that regulate gene expression [3,4]. Diploid organisms have two copies of their DNA sequence in the same nuclear environment, providing a rich source of information about how genetic variation affects biochemical processes. Alleles in a diploid genome share the same environmental signals, cell type-specific differences within a complex tissue, and other potential confounding factors. Allele specific signatures are a rigorous source of information about how DNA sequence affects gene expression

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call