Image classification is usually viewed as a visual recognition task and has extensive applications. Traditional efforts for image classification typical apply convolutional neural network (CNN) to identify the category of images. However, due to the limited receptive fields, it is difficult for CNN-based methods to model the global relations in images. This drawback leads to low classification accuracy and difficulty in handling complex and diverse image data. To model the global relationships, some researchers have applied Transformers to image classification tasks. However, to satisfy with the serialization and parallelization requirements of Transformers, the images need to be divided into equally sized and non-overlapping image patches, which breaks the local information between adjacent image blocks, which losses the spatiotemporal information of the whole image. Also, because of the limited prior knowledge of Transformers, models often need to be pre-trained on large-scale datasets, resulting in high computational complexity. To simultaneously model the spatiotemporal information between adjacent image blocks and fully utilize the global information of the images, this work propose a novel Spatiotemporal Convolutional Networks enhanced Transformer (SCN-Transformer) model for the basic image classification task. The SCN-Transformers approach can extract both local, global and spatiotemporal information between adjacent image blocks at a lower computational cost. The model comprises three components. In specific, the stacked Transformer modules capture the local correlations in the image, the SCN module fuses the local and spatiotemporal information between adjacent image blocks, and leverages long-range dependencies between different image blocks to enhance the representative capabilities of model's features, allowing the model to learn semantics from different dimensions. The classification module is responsible for the final image classification. The final experiments on the ImageNet 1K dataset demonstrate that the present model can outperform the existing mainstream image classification methods, and also achieve a competitive accuracy score of 83.7%, which confirms the competitiveness of the approach on large-scale image datasets.
Read full abstract