Abstract

BackgroundComplex network theory based methods and the emergence of “Big Data” have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework.ResultsUsing the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called “pseudo-reference” clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance.ConclusionsWe observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets.Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-016-0127-5) contains supplementary material, which is available to authorized users.

Highlights

  • IntroductionIntroduction to methodology and encoding rulesJ Chem Inf Model 28(1): . Albany Molecular Research Inc. http://www.amriglobal.com/ 18

  • Introduction to methodology and encoding rulesJ Chem Inf Model 28(1)

  • average clustering coefficient (ACC) as function of similarity threshold We studied ACC in three datasets, namely Small Combinatorial Libraries (SCL), World of Molecular Bioactivity (WOMBAT), and Molecular Libraries Small Molecule Repository (MLSMR)

Read more

Summary

Introduction

Introduction to methodology and encoding rulesJ Chem Inf Model 28(1): . Albany Molecular Research Inc. http://www.amriglobal.com/ 18. Complex network theory based methods and the emergence of “Big Data” have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. Complex network theory based clustering algorithms represent a relatively new class of methods applied to the field of cheminformatics. This class of methods can process large data sets in reasonable time. The outcome of any network based clustering is substantially influenced by the underlying network topology

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call