Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

Jiarui Chen,Yain-Whar Si,Shirley W I Siu,Chon-Wai Un

doi:10.1186/s13321-021-00570-8

Abstract

As safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home.

Highlights

The fundamental strategy in modern drug discovery and development is to identify chemical compounds that potently and selectively modulate the functions of the target molecules to elicit a desired biological response
In certain cases (KNN, Support Vector Machine (SVM), and XGBoost), we observed that the same optimal models were obtained in all replicate experiments such that the ROC-AUC scores are the same
Case study: how the similarity between unlabeled and labeled data affects the semi‐supervised learning process? In the previous section, we showed that semi-supervised learning algorithms can improve the performance of our graph convolutional neural network (GCN) models compared to models trained with purely supervised algorithm

Summary

Introduction

The fundamental strategy in modern drug discovery and development is to identify chemical compounds that potently and selectively modulate the functions of the target molecules to elicit a desired biological response. The Weave model was proposed by Kearnes et al in 2016 [14], which was a deep learning system based on molecular graph convolutions. A learnable module called Weave module, extracts and combines the features of atom and distance relationship with learnable parameters These modules can be stacked to an arbitrary depth to allow fine-tuning of the architecture for the needs of different learning tasks. In 2020, Wang et al proposed a graph attention convolutional neural network (GACNN) that classified poisonous chemicals to honey bees [16], which is a Graph Convolution Neural Network with undirected graph and attention mechanism They demonstrated that the performance of their GACNN model was better than all previous models, and they summarised important structural features that might lead to poisoning

Methods

Results

Conclusion