On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers

Cindy Trinh,Silvia Lasala,Dimitrios Meimaroglou,Olivier Herbinet

doi:10.3390/a16120573

Cindy Trinh, Silvia Lasala + Show 2 more

Open Access

https://doi.org/10.3390/a16120573

Copy DOI

Abstract

This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE and RMSE (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers

Abstract

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Journal: Algorithms	Publication Date: Dec 18, 2023
License type: CC BY 4.0

Similar Papers

Development of Monthly Reference Evapotranspiration Machine Learning Models and Mapping of Pakistan—A Comparative Study
Jizhang Wang ... Kouadri Saber
Water | VOL. 14
Jizhang Wang, et. al.Jizhang Wang ... Kouadri Saber
23 May 2022
Water | VOL. 14

Improving carbon flux estimation in tea plantation ecosystems: A machine learning ensemble approach
Ali Raza ... Yongzong Lu
European Journal of Agronomy | VOL. 160
Ali Raza, et. al.Ali Raza ... Yongzong Lu
10 Aug 2024
European Journal of Agronomy | VOL. 160

Predicting and Investigating the Permeability Coefficient of Soil with Aided Single Machine Learning Algorithm
Van Quan Tran
Complexity | VOL. 2022
Van Quan TranVan Quan Tran
01 Jan 2021
Complexity | VOL. 2022

A comprehensive comparison of various machine learning algorithms used for predicting the splitting tensile strength of steel fiber-reinforced concrete
Seyed Soroush Pakzad ... Atiye Ganjifar
Case Studies in Construction Materials | VOL. 20
Seyed Soroush Pakzad, et. al.Seyed Soroush Pakzad ... Atiye Ganjifar
27 Mar 2024
Case Studies in Construction Materials | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers

Abstract

Talk to us

Similar Papers

More From: Algorithms