An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.

Julia D Warnke,Hesham H Ali

doi:10.1186/1471-2105-14-s11-s7

Abstract

BackgroundNext generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly.ResultsPreviously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies.ConclusionsOur overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies.

Highlights

Generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates
While assembly results have been shown to be substantially improved by clustering metagenomics data before sequence assembly [13], overlap relationships retained by the assembly overlap graph are lost, leading to the removal of key global read overlap relationships and read similarities
We evaluate our algorithm’s graph coarsening and clustering results and compare them to results obtained by clustering a similar 454 metagenomics read dataset

Summary

Introduction

Generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. Metagenomics is a field of research that focuses on the sequencing of communities of organisms This adds an additional layer of complexity to the analysis of short reads produced from metagenomics samples containing multiple sources of genetic information. Often these reads must be clustered or binned into their respective genomes before assembly or analysis of the reads can take place to avoid chimeric assembly results [9]. While assembly results have been shown to be substantially improved by clustering metagenomics data before sequence assembly [13], overlap relationships retained by the assembly overlap graph are lost, leading to the removal of key global read overlap relationships and read similarities

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Sep 1, 2013
Citations: 16	License type: cc-by

R Discovery Prime

R Discovery Prime

An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

An efficient overlap graph coarsening approach for modeling short reads
Julia Warnke ... Hesham H Ali
-
Julia Warnke, et. al.Julia Warnke ... Hesham H Ali
01 Oct 2012
01 Oct 2012

QColors: an algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads.
Austin Huang ... Sorin Istrail
In silico biology | VOL. 11
Austin Huang, et. al.Austin Huang ... Sorin Istrail
01 Feb 2011
In silico biology | VOL. 11

QColors: An algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads
Austin Huang ... Leeann Schreier
-
Austin Huang, et. al.Austin Huang ... Leeann Schreier
01 Nov 2011
01 Nov 2011

Graph mining for next generation sequencing: leveraging the assembly graph for biological insights.
Julia Warnke-Sommer ... Hesham Ali
BMC genomics | VOL. 17
Julia Warnke-Sommer, et. al.Julia Warnke-Sommer ... Hesham Ali
06 May 2016
BMC genomics | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics