The application of machine learning models in chemistry has made remarkable strides in recent years. While analytical chemistry has received considerable interest from machine learning practitioners, its adoption into everyday use remains limited. Among the available analytical methods, Infrared (IR) spectroscopy stands out in terms of affordability, simplicity, and accessibility. However, its use has been limited to the identification of a selected few functional groups, as most peaks lie beyond human interpretation. We present a transformer model that enables chemists to leverage the complete information contained within an IR spectrum to directly predict the molecular structure. To cover a large chemical space, we pretrain the model using 634,585 simulated IR spectra and fine-tune it on 3,453 experimental spectra. Our approach achieves a top-1 accuracy of 44.4% and top-10 accuracy of 69.8% on compounds containing 6 to 13 heavy atoms. When solely predicting scaffolds, the model accurately predicts the top-1 scaffold in 84.5% and among the top-10 in 93.0% of cases.
Read full abstract