STRAPS v1.0: evaluating a methodology for predicting electron impact ionisation mass spectra for the aerosol mass spectrometer

David O Topping,James Allan,Bernard Aumont,M Rami Alfarra

doi:10.5194/gmd-10-2365-2017

Abstract

Abstract. Our ability to model the chemical and thermodynamic processes that lead to secondary organic aerosol (SOA) formation is thought to be hampered by the complexity of the system. While there are fundamental models now available that can simulate the tens of thousands of reactions thought to take place, validation against experiments is highly challenging. Techniques capable of identifying individual molecules such as chromatography are generally only capable of quantifying a subset of the material present, making it unsuitable for a carbon budget analysis. Integrative analytical methods such as the Aerosol Mass Spectrometer (AMS) are capable of quantifying all mass, but because of their inability to isolate individual molecules, comparisons have been limited to simple data products such as total organic mass and the O : C ratio. More detailed comparisons could be made if more of the mass spectral information could be used, but because a discrete inversion of AMS data is not possible, this activity requires a system of predicting mass spectra based on molecular composition. In this proof-of-concept study, the ability to train supervised methods to predict electron impact ionisation (EI) mass spectra for the AMS is evaluated. Supervised Training Regression for the Arbitrary Prediction of Spectra (STRAPS) is not built from first principles. A methodology is constructed whereby the presence of specific mass-to-charge ratio (m∕z) channels is fitted as a function of molecular structure before the relative peak height for each channel is similarly fitted using a range of regression methods. The widely used AMS mass spectral database is used as a basis for this, using unit mass resolution spectra of laboratory standards. Key to the fitting process is choice of structural information, or molecular fingerprint. Our approach relies on using supervised methods to automatically optimise the relationship between spectral characteristics and these molecular fingerprints. Therefore, any internal mechanisms or instrument features impacting on fragmentation are implicitly accounted for in the fitted model. Whilst one might expect a collection of keys specifically designed according to EI fragmentation principles to offer a robust basis, the suitability of a range of commonly available fingerprints is evaluated. Using available fingerprints in isolation, initial results suggest the generic public MACCS fingerprints provide the most accurate trained model when combined with both decision trees and random forests, with median cosine angles of 0.94–0.97 between modelled and measured spectra. There is some sensitivity to choice of fingerprint, but most sensitivity is in choice of regression technique. Support vector machines perform the worst, with median values of 0.78–0.85 and lower ranges approaching 0.4, depending on the fingerprint used. More detailed analysis of modelled versus mass spectra demonstrates important composition-dependent sensitivities on a compound-by-compound basis. This is further demonstrated when we apply the trained methods to a model α-pinene SOA system, using output from the GECKO-A model. This shows that use of a generic fingerprint referred to as FP4 and one designed for vapour pressure predictions (Nanoolal) gives plausible mass spectra, whilst the use of the MACCS keys in isolation performs poorly in this application, demonstrating the need for evaluating model performance against other SOA systems rather than existing laboratory databases on single compounds. Given the limited number of compounds used within the AMS training dataset, it is difficult to prescribe which combination of approach would lead to a robust generic model across all expected compositions. Nonetheless, the study demonstrates the use of a methodology that would be improved with more training data, fingerprints designed explicitly for fragmentation mechanisms occurring within the AMS, and data from additional mixed systems for further validation. To facilitate further development of the method, including application to other instruments, the model code for re-training is provided via a public Github and Zenodo software repository.

Highlights

Volatile organic compounds (VOCs), emitted from both natural and anthropogenic sources, are oxidised in the atmosphere to form lower-volatility species that condense onto aerosol particles or contribute to new particle formation (Laaksonen et al, 2008; Sipila et al, 2016; Ehn et al, 2014)
A collection of common fingerprints, and their combination, are tested in this study and their performance critically evaluated. This is an important sensitivity since one might expect a collection of keys that relate to electron impact ionisation (EI) fragmentation principles to offer a more robust basis for fitting any method used here
As we have already noted, comparing the information provided by each fingerprint with a working knowledge of the mechanics of EI fragmentation might help in understanding why a given fingerprint is more suitable

Summary

Introduction

Volatile organic compounds (VOCs), emitted from both natural and anthropogenic sources, are oxidised in the atmosphere to form lower-volatility species that condense onto aerosol particles or contribute to new particle formation (Laaksonen et al, 2008; Sipila et al, 2016; Ehn et al, 2014). Within most global and regional models, often-used techniques include modelling representative photochemical yields from specific precursors and tuning (Spracklen et al, 2011) or employing a parametric model such as the volatility basis set (Robinson et al, 2007) While both of these approaches can deliver realistic absolute concentrations, because they are not based on explicit physical processes, their predictive skill is always subject to question (Hallquist et al, 2009; Bergström et al, 2012). The development of more applicable explicit models has been facilitated by the ability to automatically predict processes rather than prescribe them (Aumont et al, 2012, 2005), as has been implemented in the Generator of Explicit Chemistry and Kinetics of Organics in the Atmosphere (GECKO-A) and the forthcoming version 4 of the MCM (http://gotw.nerc.ac.uk/list_full.asp?pcode=NE% 2FM013448%2F1) This can be supplemented by the automated prediction of properties important for partitioning, using generalised informatics tools such as UManSysProp (Topping et al, 2016). While it is unlikely that such complex models would be used directly for large-scale Eulerian chemical transport and climate models, and uncertainties with regards to fundamental properties remain (Bilde et al, 2015), they are still highly useful for benchmarking and providing the parameters for simpler models

Methods

Results

Conclusion