Abstract

Interpreting the human genome sequence is one of the major scientific endeavors of our time. In February 2001, when the human genome reference sequence was initially released (Lander et al. 2001), our understanding of the encoded contents was surprisingly limited. It was perplexing to many in the scientific community when we realized that the human genome contains only ∼21,000 distinct protein-coding genes (Claverie 2001; Hollon 2001; Pennisi 2003; Clamp et al. 2007), as other less complex species like the nematode Caenorhabditis elegans were known to have a similar number of protein-coding genes (Hillier et al. 2005). It quickly became apparent that the developmental and physiological complexity of humans would not be explained solely by the number of protein-coding genes, and the quest to understand the contents of the human genome began full force. The Encyclopedia of DNA Elements (ENCODE) Project was launched in September of 2003 with the daunting task of identifying all the functional elements encoded in the human genome sequence. To accomplish this task, the National Human Genome Research Institute (NHGRI) organized The ENCODE Project Consortium, which consists of an international group of scientists with diverse expertise in experimental and computational methods for generating and analyzing high-throughput genomic data (The ENCODE Project Consortium 2004). During the initial four years, the consortium conducted a pilot project which focused on annotating functional elements in a defined 1% of the human genome consisting of ∼30 Mb divided among 44 genomic regions. On June 14, 2007, a report summarizing the findings of the pilot project revealed pervasive transcription of the human genome, with the majority of nucleotides represented in transcripts in at least a limited number of cell types at some time (The ENCODE Project Consortium 2007). Many of these transcripts comprised novel noncoding RNA genes. Importantly, The ENCODE Pilot Project assigned function to 60% of the evolutionarily constrained bases in the 44 genomic regions and identified many additional functional elements seemingly unconstrained across mammalian evolution. Integration of the various experimental data generated by The ENCODE Pilot Project provided further insights into connections between chromatin structure (modifications and accessibility) and gene expression (The ENCODE Project Consortium 2007; Koch et al. 2007; Thurman et al. 2007; Zhang et al. 2007) and the timing of replication (Karnani et al. 2007). Armed with increased knowledge about the types of functional elements contained within the human genome sequence and with the advent of massively parallel sequencing, in 2007 the ENCODE project was expanded to study the entire human genome. This month, Nature published a paper entitled “An integrated encyclopedia of DNA elements in the human genome,” which reports the production and initial analysis of 1640 data sets focused on two major classes of annotations: genes (both coding and noncoding) along with their corresponding RNA transcripts, and transcriptional regulatory regions. This paper (The ENCODE Project Consortium 2012), along with companion papers in Nature, Genome Research, and Genome Biology, provides much more than a mere inventory of sequence elements but rather presents an integrated analysis providing important insights into the functional organization of the human genome. In alignment with the tradition of large consortia sponsored by the NHGRI, the ENCODE project has made all data and derived results available through a freely accessible database (Rosenbloom et al. 2010). The following sections describe some of the highlights of the ENCODE project, including technical accomplishments, high quality data sets, and integrated analyses with other resources, such as disease-associated variants identified through genome-wide association studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call