Abstract

Determination of molecular similarity plays an important role in analyzing large compound databases in chemical and pharmaceutical research. When molecules are described by binary vectors with bits corresponding to the presence or absence of structural features, the Tanimoto association coefficient is the most commonly used measure of similarity or chemical distance between two compounds. However, when used to select compounds for an optimal spread design, the Tanimoto coefficient produces an intrinsic bias toward smaller compounds. We have developed a new association coefficient that overcomes this bias. This article gives details of the new coefficient and contrasts the two coefficients for selecting diverse sets of compounds from a large collection. When the Tanimoto coefficient is modified as suggested to select a diverse set in the National Cancer Institute and Registry of Toxic Effects of Chemical Substances databases, the average number of features among the selected compounds increases by more than 50%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call