As one of the most significant components of the ecosystem, waterbody needs to be highly monitored at different spatial and temporal scales. Nevertheless, waterbody variations in shape, size, and reflectivity, complicated and varied types of land covers, and environmental scene diversity, present colossal challenges in achieving accurate waterbody detection (WD). In this paper, we propose a novel network coupled with the Transformer and convolutional neural network (CNN), termed WaterFormer, to automatically, efficiently, and accurately delineate waterbodies from optical high-resolution remotely sensed (HR-RS) images. This network mainly includes a dual-stream CNN, a cross-level Vision Transformer, a light-weight attention module, and a sub-pixel up-sampling module. First, the dual-stream network abstracts waterbody features at multi-views and different levels. Then, to exploit the long-range dependencies between low-level spatial information and high-order semantic features, the cross-level Vision Transformer is embedded into the dual-stream, aiming at improving WD accuracy. Afterwards, the light-weight attention module is adopted to provide semantically strong feature abstractions by enhancing discrimination neurons, and the sub-pixel up-sampling module is employed to further generate high-resolution and high-quality class-specific representations. Quantitative and qualitative evaluations demonstrated that the WaterFormer provided a promising means for detecting waterbody areas in satellite images under complex scene conditions. Moreover, comparative analyses with the state-of-the-art (SOTA) alternatives, e.g., MSFENet, MSAFNet, and BiSeNet, also verified the generalization and superiority of the WaterFormer in WD tasks. The assessment results exhibited that the WaterFormer gained an average accuracy of 97.24%, average precision of 94.59%, average recall of 91.95%, average F1-score of 93.24%, and average Kappa index of 0.9133, respectively. Additionally, we presented an open-access HR satellite imagery waterbody dataset, a mesoscale dataset with high-quality and high-precision waterbody annotation to facilitate future research in this field. The dataset has been released at https://github.com/NJdeuK/WD_Dataset.