What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Dmitry D Penzar,Ilya E Vorontsov,Alexander V Favorov,Arsenii O Zinkevich,Vasily V Sitnik,Ivan V Kulakovskiy,Vsevolod J Makeev

doi:10.3389/fgene.2019.01078

Abstract

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.

Highlights

Recent progress in medical genetics has drawn attention to sequence variants in the regulatory regions that can alter transcription factor binding (Deplancke et al, 2016), affect cell identity (Liu et al, 2017), and bring about disorders like cancer (Killela et al, 2013) and schizophrenia (Fabbri and Serretti, 2017)
The CAGI “Regulation Saturation” challenge data (Shigaki et al, 2019) included expression changes observed for more than 17 thousand induced single-nucleotide variants (SNVs) within regulatory regions using reporters constructed from 5 human enhancers (IRF4, IRF6, MYC, SORT1, ZFAND3), and 9 promoters (F9, GP1BB, HBB, HBG, HNF4A, LDLR, MSMB, PKLR, TERT), each tested in a particular cell type (TERT was tested in two cell types)
massive parallel reporter assays (MPRA) and machine learning are two recent technologies, the power of which is yet to be harnessed for the progress of genetic studies, in regulatory genomics

Summary

Introduction

Recent progress in medical genetics has drawn attention to sequence variants in the regulatory regions that can alter transcription factor binding (Deplancke et al, 2016), affect cell identity (Liu et al, 2017), and bring about disorders like cancer (Killela et al, 2013) and schizophrenia (Fabbri and Serretti, 2017). Genome-Wide Association Studies identify segments that usually contain many sequence variants, of which only one or a few may be directly involved in the development of a disorder. This puts forward the technologies that measure the impact of individual sequence variants directly. Generalization of the limited experimental data for more cell types or functional conditions can be achieved by computational approaches (Shi et al, 2018), e.g., the machine learning methods that can predict the estimated functional impact of the individual variants located in different regulatory elements in the human genome in silico

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Oct 31, 2019
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Model Soups for Various Training and Validation Data
Kaiyu Suzuki ... Tomofumi Matsuzawa
AI | VOL. 3
Kaiyu Suzuki, et. al.Kaiyu Suzuki ... Tomofumi Matsuzawa
28 Sep 2022
AI | VOL. 3

MACHINE LEARNING PREDICTION OF IN-HOSPITAL DISEASE PROGRESSION IN COMMUNITY-ACQUIRED PNEUMONIA: DERIVATION AND VALIDATION OF CLINICAL DETERIORATION IN PNEUMONIA (CDIP)
Yewande Odeyemi ... Phillip Schulte
Chest | VOL. 162
Yewande Odeyemi, et. al.Yewande Odeyemi ... Phillip Schulte
01 Oct 2022
Chest | VOL. 162

Model Evaluation Approaches for Human Activity Recognition from Time-Series Data
Lee B Hinkle ... Vangelis Metsis
-
Lee B Hinkle, et. al.Lee B Hinkle ... Vangelis Metsis
01 Jan 2020
01 Jan 2020

Changes in chromatin accessibility are not concordant with transcriptional changes for single-factor perturbations.
Karun Kiani ... Arjun Raj
Molecular Systems Biology | VOL. 18
Karun Kiani, et. al.Karun Kiani ... Arjun Raj
01 Sep 2022
Molecular Systems Biology | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics