Scene understanding is an important area in robotics and autonomous driving. To accomplish these tasks, the 3D structures in the scene have to be inferred to know what the objects and their locations are. To this end, semantic segmentation and disparity estimation networks are typically used, but running them individually is inefficient since they require high-performance resources. A possible solution is to learn both tasks together using a multi-task approach. Some current methods address this problem by learning semantic segmentation and monocular depth together. However, monocular depth estimation from single images is an ill-posed problem. A better solution is to estimate the disparity between two stereo images and take advantage of this additional information to improve the segmentation. This work proposes an efficient multi-task method that jointly learns disparity and semantic segmentation. Employing a Siamese backbone architecture for multi-scale feature extraction, the method integrates specialized branches for disparity estimation and coarse and refined segmentations, leveraging progressive task-specific feature sharing and attention mechanisms to enhance accuracy for solving both tasks concurrently. The proposal achieves state-of-the-art results for joint segmentation and disparity estimation on three distinct datasets: Cityscapes, TrimBot2020 Garden, and S-ROSeS, using only 1/3 of the parameters of previous approaches.