Retinal diseases are among nowadays major public health issues, deservedly needing advanced computer-aided diagnosis. We propose a hybrid model for multi label classification, whereby seven retinal diseases are automatically classified from Optical Coherence Tomography (OCT) images. We show that, by combining the strengths of Convolutional Neural Networks (CNNs) and Visual Transformers (ViTs), we can produce a more powerful type of model for medical image classification, especially when considering local lesion information such as retinal diseases. CNNs are indeed proved to be efficient at parameter utilization and provide the ability to extract local features and multi-scale feature maps through convolutional operations. On the other hand, ViT’s self-attention procedure allows processing long-range and global dependencies within an image. The paper clearly shows that the hybridization of these complementary capabilities (CNNs-ViTs) presents a high image processing potential that is more robust and efficient. The proposed model adopts a hierarchical CNN module called Convolutional Patch and Token Embedding (CPTE) instead of employing a direct tokenization approach using the raw input OCT image in the transformer. The CPTE module’s role is to incorporate an inductive bias, to reduce the reliance on large-scale datasets, and to address the low-level feature extraction challenges of the ViT. In addition, considering the importance of local lesion information in OCT images, the model relies on a parallel module called Residual Depthwise-Pointwise ConvNet (RDP-ConvNet) for extracting high-level features. RDP-ConvNet utilizes depthwise and pointwise convolution layers within a residual network architecture. The overall performance of the HTC-Retina model was evaluated on three datasets: the OCT-2017, OCT-C8, and OCT-2014 ; outperforming previous established models, achieving accuracy rates of 99.40%, 97.00%, and 99.77%, respectively ; and sensitivity rates of 99.41%, 97.00%, and 99.77%, respectively. Notably, the model showed high performance while maintaining computational efficiency.