Abstract
Prediction of biological activities is valuable for finding active compounds in an effective manner, and a considerable amount of attentions has been devoted to in silico predictions in drug discovery process. For in silico predictions, quantitative structure-activity relationship (QSAR) has been widely known to be useful [1, 2]. The basic purpose of QSAR is to construct a statistical model to reveal the relationship between chemical structures and their biological activities. For the statistical analysis, chemical structures are usually represented by several kinds of chemical descriptors. The QSAR model successfully trained and scientifically validated is used for predicting the biological activities of any molecules. In addition, a physicochemical and/or mechanistic interpretation can be expected from the selected chemical descriptors in the QSAR model. As a multivariate statistical method, partial least square (PLS) is of particular interest in QSAR study [3]. PLS can analyze data with strongly collinear, noisy and numerous descriptors, and also simultaneously model several biological activities. It can also provide us several application domains and diagnostic plots as the statistical measures. We can extract the complex patterns embedded in the data set. Recently, PLS has evolved or changed for copying with sever demands from the complex data structure [4, 5]. PLS has its major restriction that only linear relationship can be extracted from data [3]. Since many structure-activity data sets are inherently nonlinear in nature, it is desirable to have a flexible method, which can model any nonlinear relationships. Recently, there has been a considerable interest in machine learning methods (ML) such as Bayesian approach [6, 7] and support vector regression (SVR) [8, 9] for nonlinear modeling. In general, since ML employs a sort of mathematical transformations of chemical descriptors, they have drawback that any correlations between the biological activity and the original descriptors should be lost. This means that a direct interpretation of the model is not easy task. A lot of papers studying ML have reported their high performances for classification and regression rates, but unfortunately they have not referred to the aspect of chemical interpretation [10]. For chemical interpretation, we employed the extended connectivity fingerprint (ECFP) as the chemical descriptor for a statistical model. ECFP can facilitate to understand what substructures are correlated with a specific biological activity. An atom score was calculated from the degree of contribution of each substructure to the model. By visualizing the atom scores with the graded-colors, an atom color mapping onto each compound was performed.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.