Abstract

Recognition of human activities is an essential field in computer vision. Much of human activities consist of humanobject interaction (HOI). A lot of successful works has done on HOI recognition and achieved acceptable results, but they are fully supervised and need to training labeled data for all HOIs. The space of possible human-object interactions is huge, and listing and providing training data for all categories is costly and impractical. We tackle this problem by proposing an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique. Our method recognizes a verb and an object from video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows the identification of a new combination of verb an object as a new HOI class that not seen by the recognizer model. We introduce a neural network architecture that can understand video data. The proposed model learns verbs and objects from available training data at the training phase, and at test time can detect the pairs of verb and object in a video, and so identify the HOI class. We evaluated our model by recently introduced charades dataset which has lots of HOI categories in videos. We show that our model can detect unseen HOI classes in addition to the acceptable recognition of seen types. And so more significant number categories are identifiable than the number of training classes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call