UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.

Rafal Adamczak,Jarek Meller

doi:10.1186/s12859-016-1381-2

Abstract

BackgroundAdvances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces. The quickly growing number of experimentally resolved structures, and databases such as the Protein Data Bank, also implies large scale structural similarity analyses to retrieve and classify macromolecular data. Consequently, the computational cost of structure comparison and clustering for large sets of macromolecular structures has become a bottleneck that necessitates further algorithmic improvements and development of efficient software solutions.ResultsuQlust is a versatile and easy-to-use tool for ultrafast ranking and clustering of macromolecular structures. uQlust makes use of structural profiles of proteins and nucleic acids, while combining a linear-time algorithm for implicit comparison of all pairs of models with profile hashing to enable efficient clustering of large data sets with a low memory footprint. In addition to ranking and clustering of large sets of models of the same protein or RNA molecule, uQlust can also be used in conjunction with fragment-based profiles in order to cluster structures of arbitrary length. For example, hierarchical clustering of the entire PDB using profile hashing can be performed on a typical laptop, thus opening an avenue for structural explorations previously limited to dedicated resources. The uQlust package is freely available under the GNU General Public License at https://github.com/uQlust.ConclusionuQlust represents a drastic reduction in the computational complexity and memory requirements with respect to existing clustering and model quality assessment methods for macromolecular structure analysis, while yielding results on par with traditional approaches for both proteins and RNAs.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1381-2) contains supplementary material, which is available to authorized users.

Highlights

Advances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces
The main idea is to use structural profiles in order to define hashing keys that map similar structures into the same values of a hash function, and enable collating profiles/structures with the same keys into initial micro-clusters. These micro-clusters are subsequently either tuned to obtain a certain number (K) of clusters and data coverage, or aggregated hierarchically using the Hamming, cosine or other applicable distance measure. Building on these algorithmic engines, we present the uQlust package which combines 1D structural profiles, hashing and linear time ranking to enable ultrafast clustering of very large sets of atomistic or coarse-grained protein or RNA structures
Linear time ranking of macromolecular models As shown in [9], by projecting macromolecular 3D coordinates into a suitable 1D profile and profile pre-processing to compute the state frequency vector at each profile position, one can implicitly compare all pairs of models to compute their overall geometric consensus ranking with a linear time complexity algorithm

Summary

Background

Clustering techniques are widely used in the analysis and interpretation of molecular simulations for biological macromolecules, such as proteins and nucleic acids. These micro-clusters are subsequently either tuned (with some level of profile coarse graining and further projections/filters) to obtain a certain number (K) of clusters and data coverage (the fraction of structures included in these K clusters), or aggregated hierarchically using the Hamming, cosine or other applicable distance measure (see Fig. 1) Building on these algorithmic engines, we present the uQlust package which combines 1D structural profiles, hashing and linear time ranking to enable ultrafast clustering of very large sets of atomistic or coarse-grained protein or RNA structures. The resulting 10 distinct states can be further split based on base-pair type assignment, similar to that used for RNA-SS-LW Such defined profiles, as listed in Additional file 1: Table S1, can be used for either model assessment using 1D-Jury (denoted as uQlust:1D-ProfileName), or explicit clustering with profile hashing, using hash keys generated with a profile of choice to provide an initial ‘slicing’ of data. Work is in progress to enable the use of uQlust (in particular, for profile pre-processing) in conjunction with Hadoop Map/Reduce framework, using the Microsoft Azure plugin for C#

Results and discussion

Method

Conclusions