Human oral bioavailability is a crucial factor in drug discovery. In recent years, researchers have constructed a variety of different prediction models. However, given the limited size of human oral bioavailability data sets, the challenge of making accurate predictions with small sample sizes has become a critical issue in the field. The deep forest model, with its adaptively determinable number of cascade levels, can perform exceptionally well even on small-scale data. However, the original deep forest suffers unbalanced multi-grained scanning process and premature stopping of cascade forest training. In this paper, we propose a human oral bioavailability predict method based on an improved deep forest, called balanced multi-grained scanning mapping cascade forest (bgmc-forest). Firstly, the mordred descriptor method is selected to feature extraction, then enhanced features are obtained by the improved balanced multi-grained scanning, which solves the problem of missing features at both ends. And finally, the prediction results are obtained by feature mapping cascaded forests, which is based on principal component analysis and cascade forests, ensures the effectiveness of the cascade forest. The superiority of the model constructed in this paper is demonstrated through comparative experiments, while the effectiveness of the improved module is verified through ablation experiments. Finally the decision-making process of the model is explained by the shapley additive explanations interpretation algorithm.
Read full abstract