Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning.

Felix Heinrich,Armin Otto Schmitt,Pronaya Prosun Das,Wolfgang Link,Miriam Kamp,Mehmet Gültas,Martin Wutke

doi:10.3390/genes11060614

Abstract

Faba bean (Vicia faba) is a grain legume, which is globally grown for both human consumption as well as feed for livestock. Despite its agro-ecological importance the usage of Vicia faba is severely hampered by its anti-nutritive seed-compounds vicine and convicine (V+C). The genes responsible for a low V+C content have not yet been identified. In this study, we aim to computationally identify regulatory SNPs (rSNPs), i.e., SNPs in promoter regions of genes that are deemed to govern the V+C content of Vicia faba. For this purpose we first trained a deep learning model with the gene annotations of seven related species of the Leguminosae family. Applying our model, we predicted putative promoters in a partial genome of Vicia faba that we assembled from genotyping-by-sequencing (GBS) data. Exploiting the synteny between Medicago truncatula and Vicia faba, we identified two rSNPs which are statistically significantly associated with V+C content. In particular, the allele substitutions regarding these rSNPs result in dramatic changes of the binding sites of the transcription factors (TFs) MYB4, MYB61, and SQUA. The knowledge about TFs and their rSNPs may enhance our understanding of the regulatory programs controlling V+C content of Vicia faba and could provide new hypotheses for future breeding programs.

Highlights

New methods in the field of genome sequencing—commonly summarized as generation sequencing (NGS)—offer cost-effective strategies to produce massive amounts of sequencing data.One of these methods is genotyping-by-sequencing (GBS), which is an efficient method to obtain genome-wide genotype data for any species [1]
To assess the prediction performance we identified the number of correctly predicted promoter and non-promoter sequences as True Positives (TP) and True Negatives (TN), as well as the number of true promoter sequences predicted as non-promoter sequences, False Negatives (FN), and the number of true non-promoter sequences predicted as promoter sequences, False Positives (FP)
Classical application of GBS includes the identification and genotyping of large numbers of genomic variants. This provides several possibilities in plant breeding like the discovery of important markers by GWAS even in the absence of the reference genome. We focused on another important property of the GBS approach, namely its capacity to access regulatory regions which serves as a basis for the identification of regulatory SNPs (rSNPs) in Vicia faba

Summary

Introduction

New methods in the field of genome sequencing—commonly summarized as generation sequencing (NGS)—offer cost-effective strategies to produce massive amounts of sequencing data. One of these methods is genotyping-by-sequencing (GBS), which is an efficient method to obtain genome-wide genotype data for any species [1]. Thanks to its easy applicability, GBS is currently the method of choice in the field of plant sciences since it makes plants without reference genome amenable to genomic analysis. Several groups have applied GBS to obtain high-quality genome-wide SNP markers. These markers have often been used for applications

Objectives

Methods

Results

Conclusion