Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.
Read full abstract