MDL Drug Data Report Research Articles

Consensus clustering methods are motivated by the success of combining multiple classifiers in many areas. In this paper, graph-based consensus clustering is used to improve the quality of chemical compound clustering by enhancing the robustness, novelty, consistency and stability of individual clusterings. For this purpose, Hyper-Graph Partitioning Algorithm (HGPA) [1], was applied. The clustering is evaluated based on the ability to separate actives from inactives molecules in each cluster and the results were compared with the Ward's clustering method. The chemical dataset MDL Drug Data Report (MDDR) database has been used for experiments. The MDL Drug Data Report (MDDR) database consists of 102516 molecules. For the experiments, the dataset DS1 was chosen from the MDDR database. This dataset has been used for many virtual screening experiments [2-4]. The dataset DS1contains 10 heterogeneous activity classes (8568 molecules). For the clustering experiments, two 2D fingerprint descriptors will be used which are developed by Scitegic's Pipeline Pilot [5]. These are 120-bit ALOGP and 1024-bit extended connectivity fingerprints (ECFP_4). The results were evaluated based on the effectiveness of the methods to separate actives from non-actives molecules using QPI- (for quality partition index) measure, which was devised by Varin et al. [6]. As defined by [7], an active cluster as a non-singleton cluster for which the percentage of active molecules in the cluster is greater than the percentage of active molecules in the dataset as a whole. Let p be the number of actives in active clusters, q the number of inactives in active clusters, r the number of actives in inactive clusters (i.e., clusters that are not active clusters) and s the number of singleton actives. The high value occurs when the actives are clustered tightly together and separated from the inactive molecules. Then the quality partition index, QPI, is defined to be: QPI=pp+q+r+s (1) Then, the results will be compared with Ward's individual clustering method, the standard clustering method for chemoinformatics applications. The generation process has been done by multiple run of K-means algorithms, each with random initialization of cluster centroids. The number of partitions generated in this step is ranged between n = 5 to n = 50, with 5-times step. Then, all the generated partitions were combined using HGPA to obtain the consensus partition. This process is done for each fingerprint (ALOGP and ECFP_4). The mean of QPI values are averaged over the ten activity classes of the datasets. Tables Tables1,1, ,22 show the effectiveness of MDDR dataset clustering using ALOGP and ECFP_4 fingerprints. The best PQI value of consensus clustering methods for each column has been bold-faced for ease of reference. Table 1 Effectivenss of clustering of high diverse MDDR dataset: ALOGP Fingerprint. Table 2 Effectivenss of clustering of high diverse MDDR dataset: ECFP_4 Fingerprint. Visual inspection of the results enables comparisons to be made between the effectiveness of clustering of MDDR datasets and Ward's method, the best of choice clustering method for chemoinformatics applications. In addition, ten times of consensus clustering, for each fingerprint were observed in order to study the effectiveness of consensus clustering with different ensemble sizes. The results show that HGPA consensus clustering gives robust and novel result when K-means algorithm is run 20-50 times using ALOGP. The performance of consensus clustering outperforms the Wards' method. For consensus of dataset which represented by ECFP_4 fingerprint, the best QPI values of consensus clustering are obtained from ensembles of size n = 20-50. The performance of consensus clustering gives robust results which are better than overall performance of individual clusterings. The values of QPI in both datasets for consensus clustering are close to the Wards method. The consensus clustering, HGPA, provide stable clusters by decreasing the sensitivity to noise and outliers. The average percentages of singleton clusters of individual clusterings compared with consensus clustering for both fingerprints. The results show that consensus clustering partition the datasets with average percentage of singleton equal to zero, which is much better than individual clusterings and Wards' method. For example, 16.72% of molecules of DS1 are clustered as singletons when Wards method is applied on ALOGP fingerprint with number of clusters equal to 1000 clusters. Finally we conclude that graph-based consensus clustering can improve the effectiveness of chemical compounds clustering. The performance of consensus clustering is more robust, novel, stable, consistent, and out-perform Wards' method in case of using ALOGP fingerprint. By using ECFP_4 fingerprint, consensus clustering methods provide more robust, stable, and consistent clustering and close to the Wards clustering results. The experiments reported here suggest that graph-based consensus clustering can improve the quality of individual clustering by using the efficient algorithm, K-means algorithm, to generate the ensemble with size (20-50) for both structurally diverse chemical datasets.

Read full abstract

Quantitative or qualitative characterization of the drug-like features of known drugs may help medicinal and computational chemists to select higher quality drug leads from a huge pool of compounds and to improve the efficiency of drug design pipelines. For this purpose, the theoretical models for drug-likeness to discriminate between drug-like and non-drug-like based on molecular physicochemical properties and structural fingerprints were developed by using the naive Bayesian classification (NBC) and recursive partitioning (RP) techniques, and then the drug-likeness of the compounds from the Traditional Chinese Medicine Compound Database (TCMCD) was evaluated. First, the impact of molecular physicochemical properties and structural fingerprints on the prediction accuracy of drug-likeness was examined. We found that, compared with simple molecular properties, structural fingerprints were more essential for the accurate prediction of drug-likeness. Then, a variety of Bayesian classifiers were constructed by changing the ratio of drug-like to non-drug-like molecules and the size of the training set. The results indicate that the prediction accuracy of the Bayesian classifiers was closely related to the size and the degree of the balance of the training set. When a balanced training set was used, the best Bayesian classifier based on 21 physicochemical properties and the LCFP_6 fingerprint set yielded an overall leave-one-out (LOO) cross-validated accuracy of 91.4% for the 140,000 molecules in the training set and 90.9% for the 40,000 molecules in the test set. In addition, the RP classifiers with different maximum depth were constructed and compared with the Bayesian classifiers, and we found that the best Bayesian classifier outperformed the best RP model with respect to overall prediction accuracy. Moreover, the Bayesian classifier employing structural fingerprints highlights the important substructures favorable or unfavorable for drug-likeness, offering extra valuable information for getting high quality lead compounds in the early stage of the drug design/discovery process. Finally, the best Bayesian classifier was used to predict the drug-likeness of 33,961 compounds in TCMCD. Our calculations show that 59.37% of the molecules in TCMCD were identified as drug-like molecules, indicating that traditional Chinese medicines (TCMs) are therefore an excellent source of drug-like molecules. Furthermore, the important structural fingerprints in TCMCD were detected and analyzed. Considering that the pharmacology of TCMCD and MDDR (MDL Drug Data Report) was linked by the important common structural features, the potential pharmacology of the compounds in TCMCD may therefore be annotated by these important structural signatures identified from Bayesian analysis, which may be valuable to promote the development of TCMs.

Read full abstract

MDL Drug Data Report Research Articles

Articles published on MDL Drug Data Report

Convolutional Neural Network Model Based on 2D Fingerprint for Bioactivity Prediction.

Similarity-Based Virtual Screen Using Enhanced Siamese Multi-Layer Perceptron.

Improved Deep Learning Based Method for Molecular Similarity Searching Using Stack of Deep Belief Networks.

Stacked Ensemble for Bioactive Molecule Prediction

Ensemble learning method for the prediction of new bioactive molecules.

Quantum probability ranking principle for ligand-based virtual screening.

Structure-based virtual screening and characterization of a novel IL-6 antagonistic compound from synthetic compound database

Adapting Document Similarity Measures for Ligand-Based Virtual Screening.

A Quantum-Based Similarity Method in Virtual Screening.

Weighted voting-based consensus clustering for chemical structure databases.

Condorcet and borda count fusion method for ligand-based virtual screening.

Combining multiple clusterings of chemical structures using cluster-based similarity partitioning algorithm

Information Theory and Voting Based Consensus Clustering for Combining Multiple Clusterings of Chemical Structures.

Mining basic active structures from a large-scale database

Using graph-based consensus clustering for combining K-means clustering of heterogeneous chemical structures

3D Molecular Descriptors Important for Clinical Success

Voting-based consensus clustering for combining multiple clusterings of chemical structures.

Drug-likeness analysis of traditional Chinese medicines: 1. property distributions of drug-like compounds, non-drug-like compounds and natural compounds from traditional Chinese medicines

Drug-likeness Analysis of Traditional Chinese Medicines: Prediction of Drug-likeness Using Machine Learning Approaches

Detecting Drug Promiscuity Using Gaussian Ensemble Screening

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MDL Drug Data Report Research Articles

Articles published on MDL Drug Data Report

Convolutional Neural Network Model Based on 2D Fingerprint for Bioactivity Prediction.

Similarity-Based Virtual Screen Using Enhanced Siamese Multi-Layer Perceptron.

Improved Deep Learning Based Method for Molecular Similarity Searching Using Stack of Deep Belief Networks.

Stacked Ensemble for Bioactive Molecule Prediction

Ensemble learning method for the prediction of new bioactive molecules.

Quantum probability ranking principle for ligand-based virtual screening.

Structure-based virtual screening and characterization of a novel IL-6 antagonistic compound from synthetic compound database

Adapting Document Similarity Measures for Ligand-Based Virtual Screening.

A Quantum-Based Similarity Method in Virtual Screening.

Weighted voting-based consensus clustering for chemical structure databases.

Condorcet and borda count fusion method for ligand-based virtual screening.

Combining multiple clusterings of chemical structures using cluster-based similarity partitioning algorithm

Information Theory and Voting Based Consensus Clustering for Combining Multiple Clusterings of Chemical Structures.

Mining basic active structures from a large-scale database

Using graph-based consensus clustering for combining K-means clustering of heterogeneous chemical structures

3D Molecular Descriptors Important for Clinical Success

Voting-based consensus clustering for combining multiple clusterings of chemical structures.

Drug-likeness analysis of traditional Chinese medicines: 1. property distributions of drug-like compounds, non-drug-like compounds and natural compounds from traditional Chinese medicines

Drug-likeness Analysis of Traditional Chinese Medicines: Prediction of Drug-likeness Using Machine Learning Approaches

Detecting Drug Promiscuity Using Gaussian Ensemble Screening