For real-world simulation, terrain models must combine various types of information on material and texture in terrain reconstruction for the three-dimensional numerical simulation of terrain. However, the construction of such models using the conventional method often involves high costs in both manpower and time. Therefore, this study used a convolutional neural network (CNN) architecture to classify material in multispectral remote sensing images to simplify the construction of future models. Visible light (i.e., RGB), near infrared (NIR), normalized difference vegetation index (NDVI), and digital surface model (DSM) images were examined.This paper proposes the use of the robust U-Net (RUNet) model, which integrates multiple CNN architectures, for material classification. This model, which is based on an improved U-Net architecture combined with the shortcut connections in the ResNet model, preserves the features of shallow network extraction. The architecture is divided into an encoding layer and a decoding layer. The encoding layer comprises 10 convolutional layers and 4 pooling layers. The decoding layer contains four upsampling layers, eight convolutional layers, and one classification convolutional layer. The material classification process in this study involved the training and testing of the RUNet model. Because of the large size of remote sensing images, the training process randomly cuts subimages of the same size from the training set and then inputs them into the RUNet model for training. To consider the spatial information of the material, the test process cuts multiple test subimages from the test set through mirror padding and overlapping cropping; RUNet then classifies the subimages. Finally, it merges the subimage classification results back into the original test image.The aerial image labeling dataset of the National Institute for Research in Digital Science and Technology (Inria, abbreviated from the French Institut national de recherche en sciences et technologies du numérique) was used as well as its configured dataset (called Inria-2) and a dataset from the International Society for Photogrammetry and Remote Sensing (ISPRS). Material classification was performed with RUNet. Moreover, the effects of the mirror padding and overlapping cropping were analyzed, as were the impacts of subimage size on classification performance. The Inria dataset achieved the optimal results; after the morphological optimization of RUNet, the overall intersection over union (IoU) and classification accuracy reached 70.82% and 95.66%, respectively. Regarding the Inria-2 dataset, the IoU and accuracy were 75.5% and 95.71%, respectively, after classification refinement. Although the overall IoU and accuracy were 0.46% and 0.04% lower than those of the improved fully convolutional network, the training time of the RUNet model was approximately 10.6 h shorter. In the ISPRS dataset experiment, the overall accuracy of the combined multispectral, NDVI, and DSM images reached 89.71%, surpassing that of the RGB images. NIR and DSM provide more information on material features, reducing the likelihood of misclassification caused by similar features (e.g., in color, shape, or texture) in RGB images. Overall, RUNet outperformed the other models in the material classification of remote sensing images. The present findings indicate that it has potential for application in land use monitoring and disaster assessment as well as in model construction for simulation systems.