In the last few years, traditionally used for natural language processing tasks, recurrent neural networks have been replaced mainly by transformers. Thanks to the novel attention mechanism, they also sequentially receive text input but provide much better results than LSTM, GRU-based, or similar networks. Self-attention negates the problem of fading memory by allowing efficient evaluation of dependencies between distant tokens and provides a better means for parallelization for modern processing units like GPU. Until recently, the use of transformers for computer vision (CV) tasks was minimal. The biggest obstacles that hindered the progress in this field were immense computational complexity, the fact that the image is a grid, not a sequence-like text, and the lack of strong inductive bias, in other words, the ability to have a good grasp of local correlations, unlike their CNN counterparts. The latest slowed down the vision transformer (ViT) usage rate in semantic segmentation (SS) even more. However, it was recently shown that with sufficient data, Transformers could outperform CNN-based networks in image classification and, with the proper ViT structure, even in SS. A promising direction for providing a ViT with required training data is using semi-supervised learning (SSL), which allows for extracting helpful information from unlabeled data using only a small amount of labeled data. This approach is beneficial when solving the problem of SS since manually creating masks for images is very time-consuming. This paper proposes the robust semi-supervised ViT learning method using minimal labeled data. The combination of a strong augmentation pipeline and a dual teacher paradigm allows good performance for SS of road traffic in the unstructured environment without the need for extensive hyperparameter search
Read full abstract