Chemical derivatization is a powerful strategy to enhance sensitivity and selectivity of liquid chromatography-mass spectrometry for non-targeted analysis of chemicals in complex mixtures. However, it remains impossible to obtain large sets of reference spectra for chemically derived molecules (CDMs), representing a major barrier in real-world applications. Herein, we describe a deep learning approach that enables accurate prediction of electrospray ionization tandem mass spectra for CDMs (DeepCDM). DeepCDM is established by transfer learning from a generic spectrum predicting model using a small set of experimentally acquired tandem mass spectra of CDMs, which converts a generic model with low predictability for CDMs into a specialized model with high predictability. We demonstrate DeepCDM by predicting electrospray ionization tandem mass spectra of dansylated molecules. The success in establishing Dns-MS further enables the development of DnsBank, a dansylation-specialized in silico spectral library. DnsBank achieves significant increases of accurate annotation rates of dansylated molecules, facilitating discovery of new hazardous pollutants from an environmental study of leather industrial wastewater. DeepCDM is also highly versatile for other classes of CDMs. Therefore, we envision that DeepCDM will pave a way for high-throughput identification of CDMs in non-targeted analysis to dig unknowns with potential health impacts from emerging anthropogenic chemicals.
Read full abstract