Vision-based object detection and scene understanding are becoming key features of environment perception and autonomous driving. In the past couple of years, numerous large-scale datasets for visual object detection and semantic understanding have been released which has enormously benefited the environment perception in self-driving cars. However, these datasets like Kitti and Cityscapes only focus on well organized and urban road traffic scenarios of European countries while ignoring the dense and unpattern traffic conditions of sub-continent countries like Pakistan, India, Bangladesh, and Sri Lanka. Consequently, the environment perception system developed on these datasets cannot efficiently assist self-driving cars in traffic scenarios of sub-continent countries. To this end, we present CARL-D, a large-scale dataset and benchmark suite for develop 2D object detection and instance/pixel-level segmentation methods for self-driving cars. CARL-D comprises large-scale stereo vision-based driving videos captured from more than 100 cities of Pakistan, including motorways, dense and unpattern traffic scenarios of urban, rural, and hilly areas. As a benchmark selection, 15,000 suitable images are labeled for 2D object detection and recognition. Whereas semantic segmentation benchmark contains 2500 images with pixel-level high-quality fine annotations and 5000 coarse-annotated images which could help in enabling the deep neural networks to leverage the weakly labeled data. Alongside the dataset, we also present transfer learning-based 2D vehicle detection and scene segmentation methods to evaluate the performance of existing state-of-the-art deep neural networks on our dataset. Lastly, an extensive experimental evaluation along with the comparative study has been carried out which demonstrates the upper edge of our dataset in terms of interclass-diversity, scene-variability, and annotations richness. The proposed benchmark suite is available at https://carl-dataset.github.io/index/.