Abstract

The purpose of this work is to evaluate the ensemble data preprocessing (DP) strategy composing the selected variant of normalization, parametric time warping and baseline correction techniques in varying sequences for modelling a gas chromatography-mass spectrometry (GC–MS) data via classification and regression tree (CART) algorithm. Firstly, the relative merits between single-DP and ensemble-DP strategies were carefully compared using the best-performing sub-retention time (RT) windows reported elsewhere. Then, all the preprocessed sub-datasets were assessed based on predictive capability estimated via the CART algorithm. Performances of CART models were estimated from 50 pairs of training and testing samples that were prepared by a stratified random resampling method. Then, the three shortlisted sub-datasets were further evaluated using increased pairs of training and testing samples. Additionally, the most discriminative RT points were also identified using the three sub-datasets. Eventually, the most desired CART model was constructed using the shortlisted RT points after being treated by the most outstanding DP strategy. Results showed that 3-DP strategies tended to outperform the 1-DP and 2-DP strategies. However, the sequence of application must be carefully optimized as not all the 3-DP strategies induced positive impacts. It was found that the data aligned before baseline correction or normalization will likely outperform those being first normalized or baseline corrected. In conclusion, the untargeted GC–MS data of neat gasoline preferably be first aligned, followed by normalization, and ended by baseline correction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call