Stand-alone Version Research Articles

The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional “QSAR-ready” forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the “QSAR-ready” workflow to generate “MS-ready structures” to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.

Read full abstract

DNA sequences are increasingly used for large-scale biodiversity inventories. Because these genetic data avoid the time-consuming initial sorting of specimens based on their phenotypic attributes, they have been recently incorporated into taxonomic workflows for overlooked and diverse taxa. Major statistical developments have accompanied this new practice, and several models have been proposed to delimit species with single-locus DNA sequences. However, proposed approaches to date make different assumptions regarding taxon lineage history, leading to strong discordance whenever comparisons are made among methods. Distance-based methods, such as Automatic Barcode Gap Discovery (ABGD) and Assemble Species by Automatic Partitioning (ASAP), rely on the detection of a barcode gap (i.e., the lack of overlap in the distributions of intraspecific and interspecific genetic distances) and the associated threshold in genetic distances. Network-based methods, as exemplified by the REfined Single Linkage (RESL) algorithm for the generation of Barcode Index Numbers (BINs), use connectivity statistics to hierarchically cluster-related haplotypes into molecular operational taxonomic units (MOTUs) which serve as species proxies. Tree-based methods, including Poisson Tree Processes (PTP) and the General Mixed Yule Coalescent (GMYC), fit statistical models to phylogenetic trees by maximum likelihood or Bayesian frameworks.Multiple webservers and stand-alone versions of these methods are now available, complicating decision-making regarding the most appropriate approach to use for a given taxon of interest. For instance, tree-based methods require an initial phylogenetic reconstruction, and multiple options are now available for this purpose such as RAxML and BEAST. Across all examined species delimitation methods, judicious parameter setting is paramount, as different model parameterizations can lead to differing conclusions. The objective of this chapter is to guide users step-by-step through all the procedures involved for each of these methods, while aggregating all necessary information required to conduct these analyses. The "Materials" section details how to prepare and format input files, including options to align sequences and conduct tree reconstruction with Maximum Likelihood and Bayesian inference. The Methods section presents the procedure and options available to conduct species delimitation analyses, including distance-, network-, and tree-based models. Finally, limits and future developments are discussed in the Notes section. Most importantly, species delimitation methods discussed herein are categorized based on five indicators: reliability, availability, scalability, understandability, and usability, all of which are fundamental properties needed for any approach to gain unanimous adoption within the DNA barcoding community moving forward.

Read full abstract

Stand-alone Version Research Articles

Articles published on Stand-alone Version

The Short Form 6 Dimensions (SF-6D): Development and Evolution.

Impacts of Parameterizing Estuary Mixing on the Large-Scale Circulations in the Community Earth System Model

Reference Architecture for the Integration of Prescriptive Analytics Use Cases in Smart Factories

Technical note: A software tool to extract complexity metrics from radiotherapy treatment plans.

Combined deep-learning optimization predictive models for determining carbon dioxide solubility in ionic liquids

A simple MATLAB toolbox for analyzing calcium imaging data in vitro and in vivo

In silico assessment of biocompatibility and toxicity: molecular docking and dynamics simulation of PMMA-based dental materials for interim prosthetic restorations

Tremors—A Software App for the Analysis of the Completeness Magnitude

Solvent flashcards: a visualisation tool for sustainable chemistry

#703 Usability of a mobile health app for peritoneal dialysis patients: a pilot study

Enhancing cladding mechanical modelling during DBA/LOCA accidents with FRAPTRAN: The TUmech one-dimensional model

GWASTool: A web pipeline for detecting SNP-phenotype associations

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling

Yearly intrasubject variability of hematological biomarkers in elite athletes for the Athlete Biological Passport.

Graph-Based Imputation Methods and Their Applications to Single Donors and Families.

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

Delimiting Species with Single-Locus DNA Sequences.

FRAILTY ASSESSMENT FROM PAPER TO ONLINE: THE E-FI-CGA WEB APP

A deep learning approach to the automatic detection of alignment errors in cryo-electron tomographic reconstructions

LncRTPred: Predicting RNA-RNA mode of interaction mediated by lncRNA.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Stand-alone Version Research Articles

Articles published on Stand-alone Version

The Short Form 6 Dimensions (SF-6D): Development and Evolution.

Impacts of Parameterizing Estuary Mixing on the Large-Scale Circulations in the Community Earth System Model

Reference Architecture for the Integration of Prescriptive Analytics Use Cases in Smart Factories

Technical note: A software tool to extract complexity metrics from radiotherapy treatment plans.

Combined deep-learning optimization predictive models for determining carbon dioxide solubility in ionic liquids

A simple MATLAB toolbox for analyzing calcium imaging data in vitro and in vivo

In silico assessment of biocompatibility and toxicity: molecular docking and dynamics simulation of PMMA-based dental materials for interim prosthetic restorations

Tremors—A Software App for the Analysis of the Completeness Magnitude

Solvent flashcards: a visualisation tool for sustainable chemistry

#703 Usability of a mobile health app for peritoneal dialysis patients: a pilot study

Enhancing cladding mechanical modelling during DBA/LOCA accidents with FRAPTRAN: The TUmech one-dimensional model

GWASTool: A web pipeline for detecting SNP-phenotype associations

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling

Yearly intrasubject variability of hematological biomarkers in elite athletes for the Athlete Biological Passport.

Graph-Based Imputation Methods and Their Applications to Single Donors and Families.

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

Delimiting Species with Single-Locus DNA Sequences.

FRAILTY ASSESSMENT FROM PAPER TO ONLINE: THE E-FI-CGA WEB APP

A deep learning approach to the automatic detection of alignment errors in cryo-electron tomographic reconstructions

LncRTPred: Predicting RNA-RNA mode of interaction mediated by lncRNA.