Fire is considered to be one of the major influencing factors that cause fatalities, property damage, and economic, and ecological disruption. To perform the early detection of fire using data from vision sensors and prevent or reduce the damage that fires cause, deep learning models have been widely adopted to overcome the limitations of conventional methods. However, mainstream convolutional neural network (CNN) models have limited generalisation abilities in unseen scenarios and struggle to obtain a good trade-off among the accuracy, inference speed, and model size. Currently, vision transformers (ViT) outperform conventional CNN models; however, they are computationally expensive and required more data for training. They provide a limited performance for small and medium-sized datasets, which are very common in the fire scene classification domain. In this work, we employ a novel ViT architecture by combining shifted patch tokenisation and local self-attention modules for efficient fire scene classification and enable the model to learn from scratch even on small and medium-sized datasets. Furthermore, to make the model suitable for real-time inferencing, we modify the transformer encoder and eventually achieve a reduced number of floating-point operations and a reduced model size. Additionally, in this work, a medium-scale fire dataset is developed that contains complex real-world scenarios. Our model is assessed on three benchmark and a self-created datasets using several evaluation metrics, including a novel cross-corpse evaluation metric, as well as a robustness evaluation metric. The experimental results indicate that our model achieved an overwhelmingly better performance compared to existing methods in terms of the accuracy and model complexity.