Abstract

This work is devoted to the large-scale statistical evaluation of various aspects of using the retention index for GC-MS library search with a diverse data set. A search in a large library often does not give a correct compound even if a library contains it. One of the methods to improve a spectral library search procedure is to use the retention index information. The aim of this study is to explore some statistical peculiarities which can be helpful for development of automated software which uses a library search of diverse completely unknown compounds in a large database. A data set that was used in this work as a source of queries contains ~11 thousand spectra of compounds which belong to diverse chemical classes. Six equations for matching reference and experimental “retention index – spectrum” pairs were compared. It was found that good results can be obtained when a linear equation for similarity of pairs is used. Similarity of pairs is found as a sum of spectral similarity and of a product of a negative adjustable weight parameter and the absolute difference between reference and query retention indices. This equation performs equal or better than much more complex equations which contain two instead of one adjustable parameters. Widely used threshold-based approach, when candidates with high retention index deviation are rejected, performs worse than other equations. The use of predicted with neural networks retention indices as reference was also considered. Modern universal retention prediction models which are applicable to a wide variety of compounds are still quite inaccurate comparing with values from databases, but these predicted values allow to improve a library search as well. When predicted retention indices are used as reference, the linear equation for matching “retention index – spectrum” pairs also performs equal or better than other equations. The distribution of differences between query indices and reference indices (both calculated and experimental) was found close to exponential distribution near zero. The dependence of a fraction of correct identifications on the reference retention indices accuracy was studied. The addition of random noise with double exponential distribution to exact values was used to create “reference” retention indices with the predefined accuracy. The use of the molecular mass and molecular formula as additional constraints during a library search was also considered.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call