Human-action recognition models are neural networks that analyse visual inputs and provide classification or text outputs. This technology has and will significantly impact society in security, education, healthcare, etc. However, Human-Action Recognition models, like other neural networks, are still susceptible to malicious adversarial attacks. Therefore, this paper proposes an experimental adversarial attack towards ResNet-18 using FGSM. First, ResNet-18 is finetuned using the UCF-101 dataset, and keyframes are selected from sample videos. The keyframes will be given to ResNet-18 for classification while FGSM will be implemented, and ResNet will do another classification of the attached sample. The classification results (Original and Attacked) are given to the language model (GPT-4o) through a prompt that provides the language model with a specific role (e.g. a smart home assistant), and this section is regarded as unimodal. The original and attacked frames will be sent directly instead of the labels in the multimodal section. Lastly, this paper proposes to observe the effects on textual responses generated based on a given prompt and the classification result and evaluate the impact of the attack through Cosine Similarities and Human Evaluation.
Read full abstract