In this paper, we propose a hierarchical attention-based sensor fusion strategy for depth estimation under various weather conditions. Multiple-sensor fusion is proven as a promising solution for predicting accurate depth maps in diverse weather conditions, especially for extreme weather conditions. However, most of the current studies simply fuse the information from different sensors without jointly considering the difference in their performance at the sensor level and feature level. To fill this gap, our hierarchical attention-based fusion strategy uses two attention mask-generation modules that weigh sensor data from branches (i.e. different sensors) and features. With the cooperation of these two masks, our system is able to determine the adaptive contribution of each sensor as well as the individual contribution of each feature in the sensor regarding their performance in different weather. We compare the proposed methods with the baseline, i.e. the late fusion Sparse-to-Dense model, and two extended models individually with the branch-wise-only and feature-wise-only masks. The results show a robust and superior performance of our methods even in clear environments where the baseline already performs well enough. Moreover, we investigate the performance of RGB camera, radar, and LiDAR in foggy environments comprehensively by visualizing the generated mask. Our results show a significantly increased importance of radar sensors in extreme weather conditions, e.g. dense fog.