A general procedure for finding potentially erroneous entries in the database of retention indices

Mikhail D Khrisanfov,Dmitriy D Matyushin,Andrey S Samokhin

doi:10.1016/j.aca.2024.342375

Abstract

BackgroundThe NIST retention index database is one the most widely used sources of retention indices. In both untargeted analysis and machine learning studies filtering for potential errors is rather lacking or nonexistent. According to our estimates about 80% of the compounds from both NIST 17 and NIST 20 retention index databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection is also impractical because the database contains more than 300 000 entries. ResultsWe suggest a two-step procedure to find potentially erroneous retention indices based on machine learning. The first step is to use five predictive models to obtain predicted retention index values for the whole database. The second one is to compare these predicted values against the experimental ones. We consider a retention index erroneous if its accuracy (the difference between predicted and experimental value) is in the bottom 5% for each of the five models simultaneously. Using this method, we were able to detect 2093 outlier entries for standard and semi-standard non-polar stationary phases in the NIST 17 retention index database, 566 of those were corrected or removed by the developers in the NIST 20. SignificanceThis is a novel approach to find potentially erroneous entries in a large-scale database with mostly unique entries, which can be applied not only to retention indices. The procedure can help filter and report mishandled data to improve the quality of the dataset for machine learning applications and experimental use.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A general procedure for finding potentially erroneous entries in the database of retention indices

Abstract

Talk to us

Similar Papers

More From: Analytica Chimica Acta

Lead the way for us

Journal: Analytica Chimica Acta	Publication Date: Feb 17, 2024
Citations: 1

Similar Papers

Critical evaluation of the NIST retention index database reliability with specific examples.
Dmitriy D Matyushin ... Anastasia Yu Sholokhova
Analytical and bioanalytical chemistry | VOL. -
Dmitriy D Matyushin, et. al.Dmitriy D Matyushin ... Anastasia Yu Sholokhova
27 Sep 2024
Analytical and bioanalytical chemistry | VOL. -

Determination of retention indices in constant inlet pressure mode and conversion among different column temperature conditions in comprehensive two-dimensional gas chromatography
Shukui Zhu ... Guowang Xu
Journal of Chromatography A | VOL. 1150
Shukui Zhu, et. al.Shukui Zhu ... Guowang Xu
28 Sep 2006
Journal of Chromatography A | VOL. 1150

Contribution to linearly programmed temperature gas chromatography: Further application of the Van den Dool–Kratz equation, and a new utilization of the Sadtler retention index library
José M Santiuste ... József M Takács
Journal of Chromatography A | VOL. 1181
José M Santiuste, et. al.José M Santiuste ... József M Takács
16 Dec 2007
Journal of Chromatography A | VOL. 1181

Materials and processes of electron devices: by Max Knoll assisted by B. Kazan. 484 pages, illustrations, plates, [formula omitted] Berlin, Springer-Verlag, 1959. Price, DM 66
-
Journal of the Franklin Institute | VOL. 268
--
01 Aug 1959
Journal of the Franklin Institute | VOL. 268

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A general procedure for finding potentially erroneous entries in the database of retention indices

Abstract

Talk to us

Similar Papers

More From: Analytica Chimica Acta