Visual tracking is generally utilized in numerous applications including surveillance and autonomous driving to provide a wide angle of vision. However, different types of alteration, including occlusion, pose, and illumination present in the video sequence act as one of the crucial troubles considered in visual tracking. Hence, this research developed a modified Deep learning (DL) approach useful for distance estimation, object detection, and pixel-wise semantic segmentation by shared convolutional architecture. The proposed object detection model employs a modified Driving Scene Perception network (DSP), which makes use of multi-task learning to enhance the efficacy of distance evaluation, semantic segmentation, and object detection. The prime significance of the research relies on the Search and Hunt optimization (SaHO) algorithm, which optimally selects the hyperparameter value of the DSP network. The experimental analysis substantiates that the proposed modified DSP network effectively estimates the distance between the moving object and the camera. The experimental analysis depicts the effectiveness of the DSP with SaHO model for both segmentation and object detection. For object detection, the DSP with SaHO model attained high rates of 0.98, 0.94, and 0.96 for accuracy, specificity, and sensitivity respectively utilizing the cityscape dataset. Additionally, the DSP with SaHO approach attained the lowest errors for segmentation and surpassed other existing techniques that revealed the efficiency of the DSP with the SaHO model.