Gait analysis is an essential technique in treating patients with lower limb dysfunctions. Traditional methods often rely on expensive and complex equipment, such as wearable body sensors and a multi-camera with marker tracking system. Aiming for a more cost-effective yet accurate alternative, this paper introduces GaitFormer, a novel approach that leverages Vision Transformer (ViT) for gait analysis using minimal, non-invasive equipment, i.e. a single low-cost RGB camera. Initially, a unique dataset using a multi-camera system with marker tracking, comprising 6 walking patterns gathered from 80 volunteers is developed. The ViT-based GaitFormer is then proposed to automatically recognize human walking patterns through a single RGB camera. GaitFormer comprises hybrid networks for each step, including: (i) a cascaded convolutional 2D human key points estimation network; (ii) a ViT-based dual-stream spatial–temporal network extending the information of human key points into 3D; (iii) leveraging specific lower limb key joints’ angle features for clinical gait analysis, capturing the geometric, kinematic, and physical attributes of human motion; (iv) employing a pure self-attention-based classification network to recognize clinical human walking patterns. The experiments are designed to comprehensively validate each step against various related baseline methods and multi-camera tracking system, with results demonstrating the promising performance of GaitFormer as an affordable, precise, and integrated solution. To the best of our knowledge, GaitFormer is the first hybrid CNN- and ViT-based end-to-end solution via low-cost device for clinically valuable gait analysis.