Multimodal medical image segmentation with different imaging devices is a key but challenging task in medical image visual analysis and reasoning. Recently, U-Net based networks achieved considerable success in semantic segmentation of medical image. However, U-Net utilizes a skip-connection to connect two symmetric encoder and decoder layers. Although the single granularity information of the encoder layer is preserved through skip connection, the rich multi-scale spatial information is ignored, which greatly affects its performance in the segmentation task. In this paper, a multi-scale context-aware network (CA-Net) for multimodal medical image segmentation is proposed, which captures rich context information with dense skip connection and assigns distinct weights to different channels. CA-Net consists of four key components, namely encoder module, multi-scale context fusion (MCF) module, decoder module, and dense skip connection module. The proposed MCF module extracts multi-scale spatial information through a spatial context fusion (SCF) block, and learn to balance channel-wise features through a Squeeze-and-Excitation (SE) block. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three benchmark datasets of different modalities, including skin lesion segmentation in dermoscopy, lung segmentation in CT images, and blood vessel segmentation in retina images.