Recent brain mapping efforts are producing large-scale whole-brain images using different imaging modalities. Accurate alignment and delineation of anatomical structures in these images are essential for numerous studies. These requirements are typically modeled as two distinct tasks: Registration and segmentation. However, prevailing methods, fail to fully explore and utilize the inherent correlation and complementarity between the two tasks. Furthermore, variations in brain anatomy, brightness and texture pose another formidable challenge in designing multi-modal similarity metrics. A high-throughput approach capable of overcomingthe bottleneck of multi-modal similarity metric design, while effective leveraging the highly correlated and complementary nature of two tasks is highly desirable. We introduce a deep learning framework for joint registration and segmentation of multi-modal brain images. Under this framework, registration and segmentation tasks are deeply coupled and collaborated at two hierarchical layers. In the inner layer, we establish a strong feature-level coupling between the two tasks by learning a unified common latent feature representation. In the outer layer, we introduce a mutually supervised dual-branch network to decouple latent features and facilitate task-level collaboration between registration and segmentation. Since the latent features we designed are also modality-independent, the bottleneck of designing multi-modal similarity metric is essentially addressed. Another merit offered by this framework is the interpretability of latent features, which allows intuitive manipulation of feature learning, thereby further enhancing network training efficiency and the performance of both tasks. Extensive experiments conducted on both multi-modal and mono-modal datasets of mouse and human brains demonstrate the superiority of our method. The code is available at https://github.com/tingtingup/DCRS. Supplementary data are available at Bioinformaticsonline.