Accurate prediction of parameters related to the environmental exposure of chemicals is crucial for the sound management of chemicals. However, the lack of large data sets for training models may result in poor prediction accuracy and robustness. Herein, integrated transfer learning (TL) and multitask learning (MTL) was proposed for constructing a graph neural network (GNN) model (abbreviated as TL-MTL-GNN model) using n-octanol/water partition coefficients as a source domain. The TL-MTL-GNN model was trained to predict three bioaccumulation parameters based on enlarged data sets that cover 2496 compounds with at least one bioaccumulation parameter. Results show that the TL-MTL-GNN model outperformed single-task GNN models with and without the TL, as well as conventional machine learning models trained with molecular descriptors or fingerprints. Applicability domains were characterized by a state-of-the-art structure-activity landscape-based (abbreviated as ADSAL) methodology. The TL-MTL-GNN model coupled with the optimal ADSAL was employed to predict bioaccumulation parameters for around 60,000 chemicals, with more than 13,000 compounds identified as bioaccumulative chemicals. The high predictive accuracy and robustness of the TL-MTL-GNN model demonstrate the feasibility of integrating the TL and MTL strategy in modeling small-sized data sets. The strategy holds significant potential for addressing small data challenges in modeling environmental chemicals.
Read full abstract