Data-driven approaches (e.g., machine learning) are increasingly used to replace or assist laboratory studies in the study of emerging contaminants (ECs). In the past ten years, an increasing number of models or approaches have been applied to ECs, and the datasets used are continuously enriched. However, there are large knowledge gaps between what we have found and the natural eco-environmental meaning. For most published reviews, the contents are organized by the types of ECs, but the common issues of data science, regardless of the type of pollutant, are not sufficiently addressed. To close or narrow the knowledge gaps, we highlight the following issues ignored in the field of data-driven EC research. Complicated biological and ecological data and ensemble models revealing mechanisms and spatiotemporal trends with strong causal relationships and without data leakage deserve more attention in the future. In addition, the matrix influence, trace concentration, and complex scenario have often been ignored in previous works. Therefore, an integrated research framework related to natural fields, ecological systems, and large-scale environmental problems, rather than relying solely on laboratory data-related analysis, is urgently needed. Beyond the current prediction purposes, data science can inspire the discovery of scientific questions, and mutual inspiration among data science, process and mechanism models, and laboratory and field research is a critical direction. Focusing on the above urgent and common issues related to data, frameworks, and purposes, regardless of the type of pollutant, data science is expected to achieve great advancements in addressing the eco-environmental risks of ECs.
Read full abstract