Satellite image time series (SITS) segmentation is crucial for many applications, like environmental monitoring, land cover mapping, and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine-grained annotation. We propose S4, a new self-supervised pretraining approach that significantly reduces the requirement for labeled training data by utilizing two key insights of satellite imagery: (a) Satellites capture images in different parts of the spectrum, such as radio frequencies and visible frequencies. (b) Satellite imagery is geo-registered, allowing for fine-grained spatial alignment. We use these insights to formulate pretraining tasks in S4. To the best of our knowledge, S4 is the first multimodal and temporal approach for SITS segmentation. S4’s novelty stems from leveraging multiple properties required for SITS self-supervision: (1) multiple modalities, (2) temporal information, and (3) pixel-level feature extraction. We also curate m2s2-SITS, a large-scale dataset of unlabeled, spatially aligned, multimodal, and geographic-specific SITS that serves as representative pretraining data for S4. Finally, we evaluate S4 on multiple SITS segmentation datasets and demonstrate its efficacy against competing baselines while using limited labeled data. Through a series of extensive comparisons and ablation studies, we demonstrate S4’s ability as an effective feature extractor for downstream semantic segmentation.