Abstract. Optimizing and understanding vision systems' computational accuracy is crucial as they become increasingly integrated into daily life. Dynamic Vision Sensors (DVS) and Vision Transformers (ViTs) lead computer vision technology with efficient object recognition and image processing. However, DVS data's high computational complexity poses a problem for its real-time implementations. Merging these technologies can enhance vision system performance in dynamic environments and optimize real-time DVS processing. In our work, we use a ViT architecture to classify the DVS 128 dataset and compare our results with existing works using SNNs. We analyze how our method affects accuracy and loss, experimenting with different DVS-to-ViT input patch sizes. Our results show that the large patch size of 32x32 pixels has better accuracy and smaller loss than the 4x4 pixel patches as epochs increase. Our method also achieved a high 98.4% accuracy and low 0.22 loss within five epochs, significantly outperforming previous works averaging 93.13% accuracy over more epochs. These results highlight ViTs' large potential for real-time DVS data classification in applications that require high accuracy, like autonomous vehicles and surveillance systems.
Read full abstract