Abstract

BackgroundSequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.ResultsTo systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.Conclusions76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users.

Highlights

  • Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome

  • Identification of Missing common sequence (micSeq) by pooling samples from 1 K Genome Project Our pipeline to detect micSeqs, contains four steps: 1) Data collection and quality control; 2) Identification of One End Anchor reads (Figure 2) and de novo assembly of micSeqs using pooled OEA reads from 363 individuals

  • Pooling permits the use of shallow sequence data and the identification of common sequences is dependent on a stringent coverage cutoff for de novo assembly, in this case at least 50× coverage; 3) Identification and filtering of false micSeqs based on the genomic location of anchor reads corresponding to the orphan reads that are assembled together (Figure 2); 4) Validation of filtered micSeqs via comparative genomic analysis on a database of vertebrate genomes, comparison to the deeply sequenced genomes of four individuals, and comparison to RNA-Seq data, as well as direct experimental examination of micSeqs

Read more

Summary

Introduction

Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. The completion of the human reference genome and development of high throughput technologies, such as microarray and generation sequencing (NGS), have revolutionized methods for the efficient assessment of genome composition in human populations. Genome surveys revealed an unexpectedly large extent of other categories of sequence variants, including copy-number variants (CNV), inversions, translocations, and small to large sequence insertions and deletions [7]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.