Realistic artificial DNA sequences as negative controls for computational genomics.

Juan Caballero,Arian F A Smit,Leroy Hood,Gustavo Glusman

doi:10.1093/nar/gku356

Abstract

A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.

Highlights

Genomes evolve by random accumulation of mutations and by selection for a variety of functional requirements
We model the genome as being composed of three classes of sequence: (i) sequences under functional or mutational constraints, (ii) sequences that arose by duplication but are largely unconstrained, and (iii) a background or ‘base’ sequence
Based on the available annotation of the human genome, we identified 574 Mb of ‘base’ sequence (17% of the genome) left after removing all fragments annotated as genes, pseudogenes, CpG islands, ultraconserved sequences and repetitive sequences

Summary

Introduction

Genomes evolve by random accumulation of mutations and by selection for a variety of functional requirements. For species with short generation time and large population sizes (e.g. bacteria), the strong selective forces lead to highly optimized genomes, dense in genes and with negligible overhead of non-functional sequences: this makes prokaryotic gene prediction relatively straightforward [1]. The genomes of species with much longer generation times and much reduced population sizes (e.g. vertebrates) accumulate vast amounts of genetic material that largely appears not to be under selective constraints [2]. Functional sequences and regulatory elements are a small fraction of the vertebrate genome, making their identification difficult. Recognizing alternative splicing demands further algorithmic complexity, as does modeling of non-coding transcripts. For all these reasons and more, ab initio vertebrate gene prediction poses a significant challenge for computational biology

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nucleic Acids Research	Publication Date: May 6, 2014
Citations: 28	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Realistic artificial DNA sequences as negative controls for computational genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research

Lead the way for us

Similar Papers

Nonlinear Response Potential of Mainshock-Aftershock Sequences from Japanese Earthquakes
K Goda
Bulletin of the Seismological Society of America | VOL. 102
K GodaK Goda
01 Oct 2012
Bulletin of the Seismological Society of America | VOL. 102

Seismic collapse assessment of intermediate RC moment frames subjected to mainshock-aftershock sequences
Ali Banayan-Kermani ... Khosrow Bargi
Results in Engineering | VOL. 20
Ali Banayan-Kermani, et. al.Ali Banayan-Kermani ... Khosrow Bargi
29 Nov 2023
Results in Engineering | VOL. 20

Effect of training datasets on support vector machine prediction of protein‐protein interactions
Siaw Ling Lo ... Cong Zhong Cai
PROTEOMICS | VOL. 5
Siaw Ling Lo, et. al.Siaw Ling Lo ... Cong Zhong Cai
01 Mar 2005
PROTEOMICS | VOL. 5

A Review on the Selection of Real and Artificial Seismic Sequences for Analysis
Berlin Sabu ... S.Deepa Balakrishnan
ASPS Conference Proceedings | VOL. 1
Berlin Sabu, et. al.Berlin Sabu ... S.Deepa Balakrishnan
19 Dec 2022
ASPS Conference Proceedings | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Realistic artificial DNA sequences as negative controls for computational genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research