Echocardiography is a key tool for the diagnosis of cardiac diseases, and accurate left ventricular (LV) segmentation in echocardiographic videos is crucial for the assessment of cardiac function. However, since semantic segmentation of video needs to take into account the temporal correlation between frames, this makes the task very challenging. This article introduces an innovative method that incorporates a modified mixed attention mechanism into the SegFormer architecture, enabling it to effectively grasp the temporal correlation present in video data. The proposed method processes each time series by encoding the image input into the encoder to obtain the current time feature map. This map, along with the historical time feature map, is then fed into a time-sensitive mixed attention mechanism type of convolution block attention module (TCBAM). Its output can serve as the historical time feature map for the subsequent sequence, and a combination of the current time feature map and historical time feature map for the current sequence. The processed feature map is then input into the Multilayer Perceptron (MLP) and subsequent networks to generate the final segmented image. Through extensive experiments conducted on two different datasets: Hamad Medical Corporation, Tampere University, and Qatar University (HMC-QU), Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CAMUS) and Sunnybrook Cardiac Data (SCD), achieving a Dice coefficient of 97.92 % on the SCD dataset and an F1 score of 0.9263 on the CAMUS dataset, outperforming all other models. This research provides a promising solution to the temporal modeling challenge in video semantic segmentation tasks using transformer-based models and points out a promising direction for future research in this field.