High-resolution land surface temperature (LST) data are essential for fine-scale urban thermal environment studies. Urban LST downscaling studies mostly remain focused on only two-dimensional (2-D) data, and neglect the impact of three-dimensional (3-D) surface structure on LST. In addition, the choice of window size is also important for LST downscaling over heterogeneous surfaces. In this study, we downscaled Landsat-LST using localized and stepwise approaches in a random forest model (RF). In addition, both 2- and 3-D building morphologies were included. Our results show that: (1) The performances of a local moving window and stepwise downscaling are dependent on the extent of surface heterogeneity. For mixed surfaces, a localized window performed better than the global window, and a stepwise approach performed better than a single-step approach. However, for monotonous surfaces (e.g., urban impervious surfaces), the global window performed better than a localized window; (2) That multi-scale geographically weighted regression (MGWR) could provide a possibility for selection of the optimal moving window. 7 × 7 windows derived from MGWR by the minimum bandwidth of predictors, performed better than other windows (3 × 3, 5 × 5, and 11 × 11) in the Beijing area; (3) That the morphology of buildings has a non-negligible impact and scaling effect on urban LST. When building morphologies were included in downscaling, the performance of the RF model improved. Furthermore, the importance of the sky view factor, building height, and building density was greater at a higher resolution than at a lower resolution.