Abstract

Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical properties from molecular structure. In this article, the ongoing work to describe the chemical connectivity of entries contained in the Crystallography Open Database (COD) in SMILES format is reported. This collection of SMILES is publicly available for chemical (substructure) search or for any other purpose on an open-access basis, as is the COD itself. The conventions that have been followed for the representation of compounds that do not fit into the valence bond theory are outlined for the most frequently found cases. The procedure for getting the SMILES out of the CIF files starts with checking whether the atoms in the asymmetric unit are a chemically acceptable image of the compound. When they are not (molecule in a symmetry element, disorder, polymeric species,etc.), the previously published cif_molecule program is used to get such image in many cases. The program package Open Babel is then applied to get SMILES strings from the CIF files (either those directly taken from the COD or those produced by cif_molecule when applicable). The results are then checked and/or fixed by a human editor, in a computer-aided task that at present still consumes a great deal of human time. Even if the procedure still needs to be improved to make it more automatic (and hence faster), it has already yielded more than 160,000 curated chemical structures and the purpose of this article is to announce the existence of this work to the chemical community as well as to spread the use of its results.

Highlights

  • Format and conventions For representing the chemical connectivity of the chemical species contained in the Crystallography Open Database (COD), we have chosen the Simplified Molecular Input Line Entry Specification (SMILES), a very widely used format to store this kind of information that has an open specification [19], which is virtually identical to the original specification created by Daylight Chemical Information Systems [20]

  • That the SMILES format has its drawbacks: it is based on the valence bond theory (VBT) and it is troublesome to represent chemical species that do not fit in this theory, but, as stated in the introduction, the concept of “chemical connectivity” itself is tightly linked to the VBT and we will probably find this drawback in any other existing alternative format that uses the same theory

  • The process of creating allcod.fs from allcod. smi is useful as yet another check to detect possible syntax errors, since the presence of a syntactically wrong SMILES in allcod.smi interrupts the creation of allcod.fs and the moment in which the process is interrupted directly points to the position of the offending entry

Read more

Summary

Introduction

The importance of making scientific data and scientific knowledge open to everybody and free of most licensing and copyright barriers is being recognised by a growing number of people and institutions. In this sense, the United Nations Educational, Scientific and Cultural Organisation (UNESCO) has a commitment for the promotion and support of open access to scientific information [7]. The goal of the project is to collect all experimentally determined crystal structures of the socalled “small molecule” compounds, excluding only the macromolecular biological compounds already accessible through the PDB, making the data freely available to anyone and breaking the rather artificial separation between organic, metal–organic, inorganic and metallic

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call