Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics\u2020

Ramón Alain Miranda-Quintana,Dávid Bajusz,Károly Héberger,Anita Rácz

doi:10.1186/s13321-021-00505-3

Ramón Alain Miranda-Quintana, Dávid Bajusz + Show 2 more

Open Access

https://doi.org/10.1186/s13321-021-00505-3

Copy DOI

Abstract

Quantification of the similarity of objects is a key concept in many areas of computational science. This includes cheminformatics, where molecular similarity is usually quantified based on binary fingerprints. While there is a wide selection of available molecular representations and similarity metrics, there were no previous efforts to extend the computational framework of similarity calculations to the simultaneous comparison of more than two objects (molecules) at the same time. The present study bridges this gap, by introducing a straightforward computational framework for comparing multiple objects at the same time and providing extended formulas for as many similarity metrics as possible. In the binary case (i.e. when comparing two molecules pairwise) these are naturally reduced to their well-known formulas. We provide a detailed analysis on the effects of various parameters on the similarity values calculated by the extended formulas. The extended similarity indices are entirely general and do not depend on the fingerprints used. Two types of variance analysis (ANOVA) help to understand the main features of the indices: (i) ANOVA of mean similarity indices; (ii) ANOVA of sum of ranking differences (SRD). Practical aspects and applications of the extended similarity indices are detailed in the accompanying paper: Miranda-Quintana et al. J Cheminform. 2021. https://doi.org/10.1186/s13321-021-00504-4. Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons.

Highlights

A large number of molecular representations exist, and there are several methods to quantify the similarity of molecularMolecular similarity has been established as the basis of ligand-based virtual screening, as well as molecular informatics [8]
The central element of our work is to provide a general framework for comparing multiple objects at the same time, which naturally extends the range of validity of most of the similarity indices commonly used in cheminformatics and drug design
Individual index variations To explore how the introduced extended similarity metrics behave for different input data, we have generated random dichotomous fingerprints of various lengths (m = 10, 100, 1000 or 100,000) and calculated the extended similarity values for various numbers of compared objects, according to both the weighted (w) and non-weighted formulas

Summary

Introduction

A large number of molecular representations exist, and there are several methods (similarity and distance measures) to quantify the similarity of molecularMolecular similarity has been established as the basis of ligand-based virtual screening, as well as molecular informatics (a collective term encompassing various specific applications of cheminformatics principles, such as compound library design or molecular property predictions) [8]. A large number of molecular representations exist, and there are several methods (similarity and distance measures) to quantify the similarity of molecular. Information theory has provided some metrics on similarity. The merits of pairwise fingerprint comparisons have been exhausted on a large scale [15]. Todeschini et al summarized many of the binary similarity coefficients that have been developed so far [1, 16]. In our earlier works we have investigated the applicability of binary similarity coefficients, proved their equivalency or superiority [17,18,19]. We could find better similarity coefficents than the most frequently applied Tanimoto index [2] and formulated constraints about finding the best equations for fitting data [20]

Objectives

Methods

Results

Conclusion