Abstract

In this study, an autoencoder-based molecular structure embedding model was developed to predict treatability of micropollutant in a drinking water treatment plant (DWTP) by machine learning using 69 micropollutants monitoring data at 18 DWTPs for three years. The molecular structure, which contains physicochemical characteristics, was embedded as a fixed-length vector that is advantageous for data-driven analysis and machine learning. First, the molecular structure of the micropollutants was converted to a sequence of tokens using the simplified molecular-input line-entry system (SMILES) pair encoding tokenizer, a frequency-based tokenization method. It was then compressed into fixed-length vectors using an autoencoder trained on various molecular structures within the Chemical Entities of Biological Interest. To validate the proposed models, a binary classification of micropollutant treatability was performed using the embedded molecular structure of micropollutants with various external features, such as concentration, season, and the presence of specific drinking water treatment processes by machine learning. The accuracy of the developed model for the 69 micropollutants in this study was 0.86, and the molecular structure was determined to be the most important feature. Furthermore, an accuracy of 0.71 was obtained in external validation for pharmaceuticals and personal care products that were not used for training. This shows that the proposed embedding vector can be generalized to unseen molecules during the training process, which means that it reflects the characteristics of the molecular structures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call