Optical coherence tomography (OCT) has become the leading imaging technique in diagnosing and treatment planning for retinal diseases. Retinal OCT image segmentation involves extracting lesions and/or tissue structures to aid in the decisions of ophthalmologists, and multi-class segmentation is commonly needed. As the target regions often spread widely inside the retina, and the intensities and locations of different categories can be close, good segmentation networks must possess both global modeling capabilities and the ability to capture fine details. To address the challenge in capturing both global and local features simultaneously, we propose HyFormer, an efficient, lightweight, and robust hybrid network architecture. The proposed architecture features parallel Transformer and convolutional encoders for independent feature capture. A multi-scale gated attention block and a group positional embedding block are introduced within the Transformer encoder to enhance feature extraction. Feature integration is achieved in the decoder composed of the proposed three-path fusion modules. A class activation map-based cross-entropy loss function is also proposed to improve segmentation results. Evaluations are performed on a private dataset with myopic traction maculopathy lesions and the public AROI dataset for retinal layer and lesion segmentation with age-related degeneration. The results demonstrate HyFormer's superior segmentation performance and robustness compared to existing methods, showing promise for accurate and efficient OCT image segmentation. .