Abstract

Contigs assembled from 454 reads from bacterial genomes demonstrate a range of read depths, with a number of contigs having a depth that is far higher than can be expected. For reference genome sequence datasets, there exists a high correlation between the contig specific read depth and the number of copies present in the genome. We developed a sequence of applied statistical analyses, which suggest that the number of copies present can be reliably estimated based on the read depth distribution in de novo genome assemblies. Read depths of contigs of de novo cyanobacterial genome assemblies were determined, and several high read depth contigs were identified. These contigs were shown to mainly contain genes that are known to be present in multiple copies in bacterial genomes. For these assemblies, a correlation between read depth and copy number was experimentally demonstrated using real-time PCR. Copy number estimates, obtained using the statistical analysis developed in this work, are presented. Per-contig read depth analysis of assemblies based on 454 reads therefore enables de novo detection of genomic repeats and estimation of the copy number of these repeats. Additionally, our analysis efficiently identified contigs stemming from sample contamination, allowing for their removal from the assembly.

Highlights

  • During assembly of shotgun datasets using the 454 (Newbler) assembly program, reads stemming from regions in the genome that are repeated, that is, present in multiple copies with a high degree of similarity, “collapse” into a single contig [1, 2]

  • Studying contigs of a shotgun assembly of a cyanobacterial genome (Planktothrix rubescens NIVA CYA 98), we discovered that certain genes were likely present in several copies in the genome [4]

  • The results show an excellent match between predicted copy number and number of blast hits

Read more

Summary

Introduction

During assembly of shotgun datasets using the 454 (Newbler) assembly program, reads stemming from regions in the genome that are repeated, that is, present in multiple copies with a high degree of similarity, “collapse” into a single contig [1, 2]. Each collapsed contig only represents a repeated part of the genome including all (parts of) reads derived from the repeated regions. Due to the even distribution of 454 reads over the genome [1, 5,6,7,8], the per-contig read depth should be linearly proportional to the number of genomic copies present in the genome. To this end, we have analyzed whole genome shotgun assemblies of sequence read datasets obtained with 454 technology, focusing on bacterial genome and BAC datasets for which no pairwise information was present. We find that if we assemble genomes based on DNA samples that include some level of contamination, read depth analysis can be used to effectively identify, quantify, and remove the contaminant

Methods
Results and Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.