Compared with the progress made on human activity classification, much less success has been achieved on human interaction understanding (HIU). Apart from the latter task is much more challenging, the main causation is that recent approaches learn human interactive relations via shallow graphical representations, which are inadequate to model complicated human interactive-relations. This paper proposes a deep consistency-aware framework aiming at tackling the grouping and labelling inconsistencies in HIU. This framework consists of three components, including a backbone CNN to extract image features, a factor graph network to implicitly learn higher-order consistencies among labelling and grouping variables, and a consistency-aware reasoning module to explicitly enforcing consistencies. The last module is inspired by our key observation that the consistency-aware reasoning bias can be embedded into an energy function or a particular loss function, minimizing which delivers consistent predictions. An efficient mean-field inference algorithm is proposed, such that all modules of our network could be trained in an end-to-end fashion. Experimental results demonstrate that the two proposed consistency-learning modules complement each other, and both make considerable contributions in achieving leading performance on three benchmarks of HIU. The effectiveness of the proposed approach is further validated by experiments on detecting human-object interactions.
Read full abstract