Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Adam Hughes,Qunfeng Dong,Yang Ruan,Geoffrey Fox,Seung-Hee Bae,Mina Rho,Judy Qiu,Saliya Ekanayake

doi:10.1186/1471-2105-13-s2-s9

Adam Hughes, Qunfeng Dong + Show 6 more

Open Access

https://doi.org/10.1186/1471-2105-13-s2-s9

Copy DOI

Abstract

BackgroundModern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.MethodsPairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.ResultsThis study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.ConclusionsAlthough work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.

Highlights

Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification
Interpolation: 50000 in-sample sequences, 50000 out-ofsample sequences Figure 5 shows the results of running interpolative Multidimensional Scaling (MDS) and NW on the same 100,000 sequences, with 50,000 insample and 50,000 out-of-sample data points
This study demonstrates the effectiveness of combining the Needleman-Wunsch genetic distance algorithm with Multidimensional Scaling (MDS) to enable visual identification of sequence clusters in a large sample of raw reads from the 16S rRNA genome

Summary

Introduction

Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. Alignment of sequences across these large data sets (100,000+ sequences) is of particular interest for the purposes of sequence classification and identification of potential gene clusters and families, but such analysis cannot be completed manually and represents a daunting computational task. The aim of this work is the development of an efficient and effective pipeline for clustering large quantities of raw biosequence reads

Objectives

Methods

Results

Conclusion