Abstract

BackgroundOnly a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.ResultsWe developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built “bag-of-k-mers” and “vector-k-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-k-mers” achieved higher overall accuracy, while the “vector-k-mers” models were more useful in highlighting key groups of sequences within the regulatory regions.ConclusionsThese models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.

Highlights

  • A small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious

  • Weighted frequencies and co-occurrences of short sequences can accurately discriminate regulatory from random genomic regions To build accurate classifiers we collected a comprehensive set of regions enriched in regulatory function, as identified in B73 through different biochemical assays

  • We included in the open chromatin regions by MNA-seq derived from two tissues [3], binding loci from ChIP-seq peaks of two transcription factor (TF) (i.e., Homeobox KNOTTED 1 – KN1, bZIP FASCIATED EAR4 – FASCIATED EAR 4 (FEA4)) [34, 35], and core promoter regions around TSSs [36,37,38] (Additional file 1: Table S1)

Read more

Summary

Introduction

A small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. Biochemical characterization of the open chromatin space in B73 (the maize reference line), revealed that as much as 40% of the significant sequence polymorphisms - as identified through variance components analyses – overlap with regions in which regulatory elements are expected [3]. These biochemical assays are prohibitively expensive and time consuming at the scale of breeding programs for any crop species. This is even more true for species, such as maize, with high genomic diversity and a high rate of polymorphism.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.