Abstract

In recent years, the Artificial Intelligence of Things (AIoT) has led to the rapid development of edge computing, and existing video-captioning systems can be deployed directly on AIoT-enabled cameras (hereafter referred to as edge cameras), which have increasingly powerful computing resources. Therefore, we propose a lightweight dense-video-captioning model based on the Transformer framework to improve execution efficiency for video-caption generation on edge cameras. In addition, to investigate the effect of concept drift on video captioning, we also propose an environment-aware adaptation to allow the system to respond to changes in the environment in order to produce more accurate video captions. Experimental results on the lightweight dense-video-captioning model show that the bilingual evaluation understudy (BLEU) metrics can increase by up to 23.5%, the computation time decreases by 46.4%, and the edge camera operates at a rate of 27.63 FPS, which is also faster than existing approaches by 4.7%. In addition, the mean average precision of the environment-aware adaptation is up to 11.3% higher than the existing approaches. In conclusion, the proposed approaches perform better than previous ones and are more flexible in different weather scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call