High-quality rainfall data are essential in many water management problems, including stormwater management, water resources management, and more. Due to the high spatial–temporal variations, rainfall measurement could be challenging and costly, especially in urban areas. This could be even more challenging in tropical regions with their typical short-duration and high-intensity rainfall events, as some of the undeveloped or developing countries in those regions lack a dense rain gauge network and have limited resources to use radar and satellite readings. Thus, exploring alternative rainfall estimation methods could be helpful to back up some shortcomings. Recently, a few studies have examined the utilisation of citizen science methods to collect rainfall data as a complement to the existing rain gauge networks. However, these attempts are in the early stages, and limited works have been published on improving the quality of such data. Therefore, this study focuses on image-based rainfall estimation with potential usage in citizen science. For this, a novel convolutional neural network (CNN) model is developed to predict rainfall intensity by processing the images captured by citizens (e.g., by smartphones or security cameras) in an urban area. The developed model is merely a complementary sensing tool (e.g., better spatial coverage) to the existing rain gauge network in an urban area and is not meant to replace it. This study also presents one of the most extensive datasets of rain image data ever published in the literature. The estimated rainfall data by the proposed CNN model of this study using images captured by surveillance cameras and smartphone cameras are compared with observed rainfall by a weather station and exhibit strong R2 values of 0.955 and 0.840, respectively.