Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers

Boyu Zhang,Pietro Cicotti,Pavan Balaji,Trilce Estrada,Michela Taufer

doi:10.1016/j.parco.2017.02.005

Abstract

Scalable method to cluster molecules from docking simulations on distributed systems.Projections and interpolations into 3-D and 6-D capture molecular geometries.Our approach scales up to 2048 processing cores and 2 TB input data.Our approach is more accurate than energy-based and centralized clustering methods. We present an efficient and accurate clustering method for the analysis of protein-ligand docking datasets on large distributed-memory systems. For each ligand conformation in the dataset, our clustering algorithm first extracts relevant geometrical properties and transforms the properties into a single metadata point in the N-dimensional (N-D) space. Then, it performs an N-D clustering on the metadata to search for predominant clusters. Our method avoids the need to move ligand conformations among nodes, because it extracts relevant data properties locally and concurrently. By doing so, we transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. Our analysis shows that when using small computer systems of up to 64 nodes, the performance is not sensitive to data content and distribution. When using larger computer systems of up to 256 nodes the scalability of simulations with strong convergence toward specific geometries is less sensitive to overheads due to the shuffling of metadata information. We also demonstrate that our method of metadata extraction captures the geometrical properties of ligand conformations more effectively and clusters and predicts near-native ligand conformations more accurately than do traditional methods, including the hierarchical clustering and energy-based scoring methods.

Full Text