Abstract
With the increasing ease of measuring and calculating multiple descriptors per molecule in quantitative structure–activity relationship, the importance of variable selection for data reduction and improving interpretability is gaining importance. While variable selection has been extensively studied in the context of supervised learning, in this paper, an unsupervised learning method is proposed for variable selection and its performance is assessed using a typical QSAR data set. Whereas there is no real dependent variable in the proposed variable selection algorithm, applied variable selection is unsupervised indeed. Besides, scores that are the linear combination of the data variables are set as dependent variables (artificial dependent variables). It includes 107 derivatives of HEPT molecule, characterized by 160 descriptors encoding the steric, hydrophobic, electronic and structural features of HEPT derivatives. The aims of this procedure are generating a subset of descriptors from a data set with the relevant variables, eliminating redundancy, and reducing multicollinearity. The core of this methodology is based on jack-knife resampling method. In this paper, using jack-knife led to selection of 48 out of 160 initial descriptors, so that the data information was preserved. Lastly, using influence effect on prediction resulted in eight descriptors as representative of the 160 descriptors. Constructed model with final 8 descriptors has Q2IN=0.67, R2=0.74, Q2EXT=0.85. It represents adequacy of our strategy for preserving data structure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.