With the success of U-Net or its variants in automatic medical image segmentation, building a fully convolutional network (FCN) based on an encoder-decoder structure has become an effective end-to-end learning approach. However, the intrinsic property of FCNs is that as the encoder deepens, higher-level features are learned, and the receptive field size of the network increases, which results in unsatisfactory performance for detecting low-level small/thin structures such as atrial walls and small arteries. To address this issue, we propose to keep the different encoding layer features at their original sizes to constrain the receptive field from increasing as the network goes deeper. Accordingly, we develop a novel S-shaped multiple cross-aggregation segmentation architecture named S-Net, which has two branches in the encoding stage, i.e., a resampling branch to capture low-level fine-grained details and thin/small structures and a downsampling branch to learn high-level discriminative knowledge. In particular, these two branches learn complementary features by residual cross-aggregation; the fusion of the complementary features from different decoding layers can be effectively accomplished through lateral connections. Meanwhile, we perform supervised prediction at all decoding layers to incorporate coarse-level features with high semantic meaning and fine-level features with high localization capability to detect multi-scale structures, especially for small/thin volumes fully. To validate the effectiveness of our S-Net, we conducted extensive experiments on the segmentation of cardiac wall and intracranial aneurysm (IA) vasculature, and quantitative and qualitative evaluations demonstrated the superior performance of our method for predicting small/thin structures in medical images.