Recognizing Urdu text in natural images is more challenging as compared to other languages, such as English, due to the cursive nature of Urdu script. However, Urdu scene text has not received enough attention from both industry and academia due to the lack of the dataset of Urdu text. We propose a large-scale Urdu Scene Text Dataset (USTD) to address this problem, which is designed for Urdu scene text detection and recognition. The proposed dataset contains 29674 text annotations (17877 Urdu and 11797 English), 749725 characters in 6389 images. It covers a wide variety of text images with both Nastaleeq and Naskh writing styles, taken from different streets and roads of Pakistan. The vast diversity of this dataset makes it a benchmark to work on and train robust neural networks for the detection and recognition of cursive text. Besides, baseline results are also provided with several state-of-the-art networks, including TextBoxes++, Seglink, DB(ResNet-50) and EAST for text localization and Convolutional Recurrent Neural Network (CRNN) for text recognition. To further evaluate the performance of these models, we have used the most popular evaluation matrices of precision, recall, and F-measure. Our experimental outputs reveal that an end-to-end combination of DB(ResNet-50) and CRNN provides the best results with precision, recall, and F-measure of 0.7526, 0.5974, and 0.6660, respectively.