Molecular image-convolutional neural network (CNN) assisted QSAR models for predicting contaminant reactivity toward OH radicals: Transfer learning, data augmentation and model interpretation

Shifa Zhong,Jiajie Hu,Xiong Yu,Huichun Zhang

doi:10.1016/j.cej.2020.127998

Abstract

In this study, we used molecular images as a representation for organic compounds and combined them with a convolutional neural network (CNN) to develop quantitative structure-activity relationships (QSARs) for predicting compound rate constants toward OH radicals. We applied transfer learning and data augmentation to train molecular image-CNN models and the Gradient-weighted Class Activation Mapping (Grad-CAM) method to interpret them. Results showed that data augmentation and transfer learning can effectively enhance the robustness and predictive performance of the models, with the root-mean-square-error (RMSE) values on the test dataset (RMSEtest) decreasing from (0.395–0.45) to (0.284–0.339) after applying data augmentation, and the RMSE on the training dataset (RMSEtrain) decreasing from (0.452–0.592) to (0.123–0.151) after applying transfer learning. The obtained molecular image-CNN models showed comparative predictive performance (RMSEtest 0.284–0.339) with the molecular fingerprint-based models (RMSEtest 0.30–0.35). Grad-CAM interpretation showed that the molecular image-CNN models correctly chose the molecular features in the images and identified key functional groups that influenced the reactivity. The applicability domain analysis showed that the molecular image-CNN models have a broader applicability domain than molecular fingerprints-based models and the reactivity of any new compounds with a maximum similarity of over 0.85 to the compounds in the training dataset can be reliably predicted. This study demonstrated that molecular image-CNN is a new tool to develop QSARs for environmental applications and can be used to build trustful models that make meaningful predictions.

Full Text