Identifying Chirality in Line Drawings of Molecules Using Imbalanced Dataset Sampler for a Multilabel Classification Task.

Yong En Kok,Simon Woodward,Ender Özcan,Mercedes Torres Torres

doi:10.1002/minf.202200068

Yong En Kok, Simon Woodward + Show 2 more

Open Access

https://doi.org/10.1002/minf.202200068

Copy DOI

Journal: Molecular informatics	Publication Date: Jun 30, 2022
License type: other-oa

Affiliation: University of Nottingham

Abstract

Chirality, the ability of some molecules to exist as two non-superimposable mirror images, profoundly influences both chemistry and biology. Advances in deep learning enable the automatic recognition of chemical structure diagrams, however, studies on discovering the molecule chirality are scarce and the machine-readable molecular representations are not always sufficient to fully support the encoding of this important property. Here, we pretrained networks on a ChEMBL+ dataset (79641 molecules) and fine-tuned them for the binary classification of chirality (achiral/chiral) or multilabel chirality type classifications (none/centre/axial/planar). To address the label combination imbalanced problem in the multilabel task, the study proposed a Formulated Imbalanced Dataset Sampler (FIDS) to sample a formulated amount of minority label combinations on top of the training set. On a 10-fold cross validation experiment using our CHIRAL dataset (1142 manually curated molecules), our models achieved up to an accuracy of 90 % in the binary task. In the multilabel task incorporated with FIDS, the overall performance increases from 87 % to 89 % and the accuracy per label combination can attained up to a 50 % increase. Through the study of heatmaps, our work also exemplified the potential of deep neural network to make predictions based on the actual location of chirality elements.

Full Text