Parallel Ward Clustering for Chemical Compounds Using MapReduce

Mohamed G Malhat,Ashraf B El-Sisi,Hamdy M Mousa

doi:10.1007/978-3-319-13461-1_25

Abstract

The availability of chemical libraries with millions of compounds makes the process of identifying similar chemical compounds more challengeable. Compounds with similar structure are likely to exhibit similar biological activity. So, the identification of these compounds is a key step in the drug discovery process. Hierarchical clustering is developed for that purpose. One of the most popular hierarchical clustering algorithms that are used in many applications in the drug discovery process is ward clustering algorithm. A fundamental problem with the previous implementations of this clustering method is its limitation to handle large data sets within a reasonable time and memory resources. In this paper, MapReduce framework is used to run ward clustering algorithm in parallel manner. The results show considerable reduction in computational time. The parallel ward algorithm saves 17% of time using 3 map instances and saves 58% of time using 6 map instances.

Full Text