Causal Discovery for Feature Selection in Physical Process-Based Hydrological Systems

Paras Sheth,Ting Liu,Qi Deng,Huan Liu,K Selcuk Candan,Durmus Doner,Rebecca Muenich,Yuhang Wei,John Sabo

doi:10.1109/bigdata55660.2022.10020794

Abstract

Physical process-based hydrological models are widely adopted to simulate the water quantity or quality. One of the most commonly used hydrological models is Soil and Water Assessment Tool (SWAT). SWAT models for a large watershed can have over tens of thousands of Hydrological Resource Units (HRUs) which necessitates considerable computational resources. One way to speed up applications of the SWAT model could be to leverage machine learning techniques to identify the crucial features for the prediction task – feature selection. However, majority of the feature selection techniques rely on correlations or some form of a score metric (e.g. mutual information). Furthermore, since correlation does not imply causation, it is important to identify the causal features to improve the prediction accuracy while enhancing the interpretability of machine learning models. However, the SWAT model uses multiple data inputs and features that typically vary by space/HRUs, but may or may not vary over time. This makes it difficult to directly utilize causal discovery models to infer the causal relations. Furthermore, due to the lack of the ground truth causal graph for the SWAT model it is difficult to comment on the validity of the learned causal relations. To overcome these problems, we propose a novel framework that first infers the causal relations for the daily scale of the SWAT data using causal discovery algorithms. Then, it utilizes a community detection module to group similar features together for better interpretability. Finally, it identifies the stable causal relations that appear most often across all the timesteps and leverage them for the prediction of the water quantity. By utilizing only the causal features for the prediction of the target variable can lead to high accuracy as it removes the reliance on spurious correlations. Furthermore, we conduct extensive experiments to validate the effectiveness of the proposed framework along with a real-world case study to evaluate whether the selected features are interpretable or not.

Full Text