Object recognition, one of the main goals of robot vision, is a vital prerequisite for service robots to perform domestic tasks. Thanks to the rich sense of information provided by RGB-D sensors, RGB-D-based object recognition has received increasing attention. However, the existing works focus on collaborative RGB and depth data for object recognition, while ignoring the influence of depth image quality on recognition performance. Moreover, in real-world scenarios, there are many objects with strong similarity from certain observation angles, which poses a challenge for the service robot to recognize objects accurately. In this paper, we propose CNN-TransNet, a novel end-to-end Transformer-based architecture with convolutional neural networks (CNNs) for RGB-D object recognition. In order to deal with the effect of high inter-class similarity, discriminative multi-modal feature representations are generated by learning and relating multi-modal features at multiple levels. Besides, we employ a multi-modal fusion and projection (MMFP) module to reweight the contribution of each modality to address the problem of poor-quality depth image. Our proposed approach achieves state-of-the-art performance on three datasets (including Washington RGB-D Object Dataset, JHUIT-50, and Object Clutter Indoor Dataset), with accuracy of 95.4%, 98.1%, and 94.7%, respectively. The results demonstrate the effectiveness and superiority of the proposed model in RGB-D object recognition task.