Abstract

The software for the IUPAC Chemical Identifier, InChI, is extraordinarily reliable. It has been tested on large databases around the world, and has proved itself to be an essential tool in the handling and integration of large chemical databases. InChI version 1.05 was released in January 2017 and version 1.06 in December 2020. In this paper, we report on the current state of the InChI Software, the details of the improvements in the v1.06 release, and the results of a test of the InChI run on PubChem, a database of more than a hundred million molecules. The upgrade introduces significant new features, including support for pseudo-element atoms and an improved description of polymers. We expect that few, if any, applications using the standard InChI will need to change as a result of the changes in version 1.06. Numerical instability was discovered for 0.002% of this database, and a small number of other molecules were discovered for which the algorithm did not run smoothly. On the basis of PubChem data, we can demonstrate that InChI version 1.05 was 99.996% accurate, and InChI version 1.06 represents a step closer to perfection. Finally, we look forward to future releases and extensions for the InChI Chemical identifier.

Highlights

  • The first version of InChI was made publicly available in the spring of 2005 and further versions [1–5], including a separate InChI for Reactions (RInChI [6, 7]), have been released over the years

  • For PubChem CID: 6,555,836, 6,589,644, 6,589,645, 11,871,423, 18,805,145, 18,805,148, 18,805,149, 18,805,151, 18,805,146, 49,950,537, 49,950,540, 101,988,808 the stereochemistry is not clear in PubChem. This illustrates the power of the InChI string to highlight molecules within a database that would benefit from further checking. These checks demonstrate that the transition from v1.05 to v1.06 is 99.99% consistent for the standard InChI string

  • InChI (RInChI) [6, 7] and into ongoing work to develop an InChI-based description of mixtures [17]

Read more

Summary

Introduction

The first version of InChI was made publicly available in the spring of 2005 and further versions [1–5], including a separate InChI for Reactions (RInChI [6, 7]), have been released over the years. Polypropylene glycol (PPG, [-O-CH2-CH(CH3)-]n) can be described, in structure-based representation, in several different but equivalent ways, Fig. 2 All six of these representations, which use a * to indicate the Zz atom, are correct; the InChI algorithm selects the highlighted one, in the middle of the bottom row, as the canonical one: InChI = 1B/C3H6OZz2/c1-3(2-5)4-6/ h3H,2H2,1H3/z101-1-4(6 -4, 5-2). All of these possibilities should produce the same InChI string, and they nearly always do This issue was addressed by using sixteen randomly-generated re-numberings for all the molecules in the database. Most of these cases are the result of a fix to a problem with molecules with acidic hydroxy groups at cationic heteroatom centres, which led to issues with numbering atoms The details of this are in the CHANGELOG file in the release. These checks demonstrate that the transition from v1.05 to v1.06 is 99.99% consistent for the standard InChI string

Findings
Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call