Abstract

The goal of chemmodlab is to streamline the fitting and assessment pipeline for many machine learning models in R, making it easy for researchers to compare the utility of these models. While focused on implementing methods for model fitting and assessment that have been accepted by experts in the cheminformatics field, all of the methods in chemmodlab have broad utility for the machine learning community. chemmodlab contains several assessment utilities, including a plotting function that constructs accumulation curves and a function that computes many performance measures. The most novel feature of chemmodlab is the ease with which statistically significant performance differences for many machine learning models is presented by means of the multiple comparisons similarity plot. Differences are assessed using repeated k-fold cross validation, where blocking increases precision and multiplicity adjustments are applied. chemmodlab is freely available on CRAN at https://cran.r-project.org/web/packages/chemmodlab/index.html.

Highlights

  • It is commonplace for researchers across a variety of fields to fit machine learning models on complex data to make predictions

  • There are a myriad of modeling methods implemented in R that may be worthwhile for researchers to try

  • Functions for computing molecular descriptors and applicability domain have been added; chemmodlab is organized into two successive components: (1) model fitting, which is primarily conducted via function ModelTrain, and (2) model assessment, which is conducted via function CombineSplits

Read more

Summary

Introduction

It is commonplace for researchers across a variety of fields to fit machine learning models on complex data to make predictions. The Pharmacophore-Least Angle Regression (LAR) combination (AUC: .71) involves a highly interpretable linear model with a subset of the Pharmacophore descriptors selected This .05 difference is small and without additional investigations it is unclear whether it is statistically significant. By performing multiple cross validation splits and using these splits as a blocking factor to improve precision, chemmodlab is able to test for statistical significance of performance measure differences and visualize these results in a manner that can be interpreted by the user The question this addresses is: if the experiment were repeated with changes to the training and/or test set, would the best performing model still be the best? Functions for computing molecular descriptors and applicability domain have been added; chemmodlab is organized into two successive components: (1) model fitting, which is primarily conducted via function ModelTrain, and (2) model assessment, which is conducted via function CombineSplits

Results and discussion
CID Outcome
Number of compounds selected
Error Rate
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.