Real-time sensing of minor, or difficult to measure, components in complex systems is a challenge faced across disciplines. For environmental applications, this often entails measurement of target chemicals-that may be harmful and/or of interest at very low concentrations - in a complex medium (tens to thousands of constituent components). Sensor arrays and machine learning (ML) approaches have demonstrated some success but remain limited by the application context and high labor costs of chemical assays for creating large training datasets. This article explores a lower overhead approach, employing data fusion techniques to extract information from a relatively small training dataset. Creation of statistically relevant synthetic training samples is utilized to reduce dependence on costly analysis of samples from the target system. Samples were characterized using eight sensor modalities to create a training set for several ML algorithms, namely artificial neural networks (ANN), support vector regression (SVR), and random forests (RF), to measure NH 4 + concentrations (≤50 μM) online in a game-changing wastewater treatment process. Hyperparameters for each method were tuned using a particle swarm optimization approach, and both the accuracy and consistency of results were evaluated. ANN achieved the lowest mean absolute error (MAE) of ~6 μM, but all methods had a minimum MAE within 20% of this value. When evaluated on computational demand, SVR outperformed other approaches. ANN and RF showed wide variation in resulting MAE for a given parameterization, demonstrating strong dependence on initialization and training process. Overall RF provided the best balance of accuracy and consistency, and therefore, in applications where data are expected to be updated frequently or computational resources are not infinite, RF may provide the best tradeoff in speed, accuracy, and consistency.
Read full abstract