To tackle the challenges of cold start and data sparsity in recommendation systems, an increasing number of researchers are integrating item features, resulting in the emergence of multimodal recommendation systems. Although graph convolutional network-based approaches have achieved significant success, they still face two limitations: (1) Users have different preferences for various types of features, but existing methods often treat these preferences equally or fail to specifically address this issue. (2) They do not effectively distinguish the similarity between different modality item features, overlook the unique characteristics of each type, and fail to fully exploit their complementarity. To solve these issues, we propose the user perception-guided graph convolutional network for multimodal recommendation (UPGCN). This model consists of two main parts: the user perception-guided representation enhancement module (UPEM) and the multimodal two-step enhanced fusion method, which are designed to capture user preferences for different modalities to enhance user representation. At the same time, by distinguishing the similarity between different modalities, the model filters out noise and fully leverages their complementarity to achieve more accurate item representations. We performed comprehensive experiments on the proposed model, and the results indicate that it outperforms other baseline models in recommendation performance, strongly demonstrating its effectiveness.