Stream-observing cameras have recently been deployed in stream systems to monitor water depth dynamics. However, most existing image-based water depth monitoring methods require additional gauging equipment, extensive manual annotations, or complex manual calibration. In this paper, we propose the hierarchical model, a novel multi-modal and multi-scale deep learning framework for monitoring water depth in headwater streams with only a field camera capable of night vision and no additional equipment. In particular, the hierarchical model integrates long-term dynamic patterns extracted from large-scale meteorological data with short-term dynamic patterns extracted from small-scale stream image data to jointly monitor water depth at a fine-level temporal resolution. In order to overcome the issue of limited availability of images, we introduce a transfer learning strategy and incorporate more accurate long-term patterns that enable the hierarchical model to perform competitively even with a small number of images. We evaluate our method on a real-world headwater stream monitoring dataset from the West Brook study area in western Massachusetts, United States. Our extensive experiments demonstrate that the hierarchical model outperforms several state-of-the-art methods for water depth monitoring, and that more accurate long-term patterns can better guide the monitoring of short-term patterns with excellent flexibility and less computational cost. The mean absolute error of our hierarchical model achieves a remarkable level of 4.9cm at the study site with 0.89m average water depths, and only 12.5cm at more drastically varied site with 3.95m average depths.