Abstract

Machine learning tools have been developed to analyze quantitative structure-activity/property relationship (QSAR/QSPR) modeling research. Better feature selection algorithms in the ensemble methods have been used to advance QSPR/QSAR modeling, helping to understand the relation between features and target variables and reducing the computational requirements. Implementing feature importance allows for a more effective and clearer view into features' relative importance and interpret the predictions. However, the main struggle of ensemble learning methods is that each model leads to different feature selections for interpretation. Therefore, it is necessary to summarize each model and its corresponding features for better performance, resulting in high prediction accuracy. In this article, we use a blending method for prediction and interpretability in terms of the experimental values of fluorescence wavelengths. The blender requires two levels. The first level uses multiple classifiers: Random Forest, ExtraTrees, Adaptive Boosting, and Gradient Boosting. The second level requires a linear blending method that summarizes information from the classifiers. Even though the ensemble learning models accurately predict properties and activities, the algorithms are often susceptible so that even small changes can drastically impact their efficiency and accuracy. Thus, the main idea to overcome the difficulty is to implement multiple times feature selections in each model to manipulate the sensitivity. Furthermore, it accurately predicts the fluorescence data set from a regression task of the Decision Tree based (DT-based) QSAR/QSPR model. This paper provides the best-optimized features when considering specific experimental chemical or biological values. Furthermore, the tables and figures representing each model's feature selections and accuracy demonstrate the result. It shows that even though the number of features for predicting the Fluorescence Emission Wavelength reduces, the accuracy of training and test sets is maintained, and the computational effectiveness is increased.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call