The spatial distribution of organics in geological samples can be used to determine when and how these organics were incorporated into the host rock. Mass spectrometry (MS) imaging can rapidly collect a large amount of data, but ions produced are mixed without discrimination, resulting in complex mass spectra that can be difficult to interpret. Here, we apply unsupervised and supervised machine learning (ML) to help interpret spectra from time-of-flight-secondary ion mass spectrometry (ToF-SIMS) of an organic-carbon-rich mudstone of the Middle Jurassic of England (UK). It was previously shown that the presence of sterane molecular biomarkers in this sample can be detected via ToF-SIMS (Pasterski, M. J. et al., Astrobiology 2023, 23, 936). We use unsupervised ML on scanning electron microscopy-electron dispersive spectroscopy (SEM-EDS) measurements to define compositional categories based on differences in elemental abundances. We then test the ability of four ML algorithms─k-nearest neighbors (KNN), recursive partitioning and regressive trees (RPART), eXtreme gradient boost (XGBoost), and random forest (RF)─to classify the ToF-SIM spectra using (1) the categories assigned via SEM-EDS, (2) organic and inorganic labels assigned via SEM-EDS, and (3) the presence or absence of detectable steranes in ToF-SIMS spectra. In terms of predictive accuracy and balanced accuracy, KNN was the best performing model and RPART the worst. The feature importance, or the specific features of the ToF-SIM spectra used by the models to make classifications, cannot be determined for KNN, preventing posthoc model interpretation. Nevertheless, the feature importance extracted from the other models was useful for interpreting spectra. We determined that some of the organic ions used to classify biomarker containing spectra may be fragment ions derived from kerogen which is abundant in this mudstone sample.
Read full abstract