To address the challenge of precise foreign object detection on railway tracks, this study proposes RailSegVITNet, an efficient deep learning-based model that aims to accurately locate tracks and provide necessary information. RailSegVITNet integrates lightweight bottleneck blocks, separable self-attention, and feature aggregation to balance real-time performance and accuracy. It follows an encoder–decoder framework, with bottleneck blocks used for feature extraction and separable self-attention integrated at different stages to enhance feature extraction capability. The decoder utilizes feature aggregation to effectively merge multi-resolution features and improve segmentation performance. On the proposed Railway-seg dataset, the RailSegVITNet demonstrates competitive performance, achieving an MIoU (Mean Intersection over Union ) of 91.43% while utilizing only 2.01 G FLOPs (Floating Point Operations) and 1.4 M parameters. Furthermore, compared to popular models such as Topformer, Segformer, Segmenter, and Deeplab on the publicly available Railsem19 dataset, RailSegVITNet maintains its lightweight architecture while achieving comparable or higher segmentation performance. This showcases its potential practical value for efficient identification of foreign object intrusion.