Abstract
SREBP1 and 2, are cholesterol sensors able to modulate cholesterol-related gene expression responses. SREBPs binding sites are characterized by the presence of multiple target sequences as SRE, NFY and SP1, that can be arranged differently in different genes, so that it is not easy to identify the binding site on the basis of direct DNA sequence analysis. This paper presents a complete workflow based on a one-dimensional Convolutional Neural Network (CNN) model able to detect putative SREBPs binding sites irrespective of target elements arrangements. The strategy is based on the recognition of SRE linked (less than 250 bp) to NFY sequences according to chromosomal localization derived from TF Immunoprecipitation (TF ChIP) experiments. The CNN is trained with several 100 bp sequences containing both SRE and NF-Y. Once trained, the model is used to predict the presence of SRE-NFY in the first 500 bp of all the known gene promoters. Finally, genes are grouped according to biological process and the processes enriched in genes containing SRE-NFY in their promoters are analyzed in details. This workflow allowed to identify biological processes enriched in SRE containing genes not directly linked to cholesterol metabolism and possible novel DNA patterns able to fill in for missing classical SRE sequences.
Highlights
Deep Learning techniques have been widely applied to the study of nucleic acids and protein sequences in recent years
The promoter was split in 40 sequences of 100 bases with a partial overlap
Convolutional Neural Networks have been used in Genetics to detect particular nucleic acids sequences, using the property of CNNs to discover non-contiguous patterns in sequences, as it happens for enhancers [1,2,3, 12], or long non-coding RNA [13]
Summary
Deep Learning techniques have been widely applied to the study of nucleic acids and protein sequences in recent years. Convolutional Neural Networks (CNN) have been widely used for these purposes [1,2,3,4,5] because they are position-invariant, meaning that a single feature (i.e., a particular stretch of elements) can be detected irrespectively of its position within a sequence. This is very relevant in the analysis of transcription factors (TF) binding sites located in gene promoters. In the case of nucleic acids, a primary sequence is a linear vector expressed in a single dimension, so a simplification of CNN, called CNN-1D, has been developed
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.