Abstract

Difficult benchmark problems are in increasing demand in Genetic Programming (GP). One problem seeing increased usage is the oral bioavailability problem, which is often presented as a challenging problem to both GP and other machine learning methods. However, few properties of the bioavailability data set have been demonstrated, so attributes that make it a challenging problem are largely unknown. This work uncovers important properties of the bioavailability data set, and suggests that the perceived difficulty in this problem can be partially attributed to a lack of pre-processing, including features within the data set that contain no information, and contradictory relationships between the dependent and independent features of the data set. The paper then re-examines the performance of GP on this data set, and contextualises this performance relative to other regression methods. Results suggest that a large component of the observed performance differences on the bioavailability data set can be attributed to variance in the selection of training and testing data. Differences in performance between GP and other methods disappear when multiple training/testing splits are used within experimental work, with performance typically no better than a null modelling approach of reporting the mean of the training data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.