Incorporating part-whole hierarchies into fully convolutional network for scene parsing

Karim Abbasi,Parvin Razzaghi

doi:10.1016/j.eswa.2020.113662

Abstract

In this paper, a new approach to scene parsing is proposed which integrates part-whole hierarchies relationship in the last feature map to assign a semantic class label to each pixel. Recently, deep learning-based approaches have had a great impact on scene parsing. However, these methods could not preserve the spatial information about the high-level (or mid-level) features. Hence, Hinton, one of the fathers of deep learning, introduced the capsule concept to encode pose information such as orientation. All of the capsules which have a similar pose matrix value are grouped to form a parent capsule. However, their work has two challenges: 1) the extensive time required to perform dynamic routing agreement to obtain the routing coefficient and 2) the variation of the appearance and the spatial hierarchies between part capsules and their corresponding parent are not encoded. In this study, to consider these challenges, the general Hough transform (GHT) and tensor normal distribution are utilized to propose a novel capsule concept. In this case, each capsule has k offset vectors for each semantic class. The offset vectors are oriented from the capsule to the k other capsules which have an effective role in assigning that capsule to a specific semantic class. The problem formulation is proposed such that evaluating the approach on large datasets is feasable. Also, a new score function is designed to accumulate the vote’s strengths for capsule class estimation. To do so, we use tensor normal distribution in which the covariance matrix is defined as the Kronecker product of the capsule feature covariance and the between-capsule covariance. The proposed approach, for the first time, encodes the relations between part capsules to vote to a whole capsule through the between-capsule covariance matrix. To evaluate our proposed approach, it is applied to SiftFlow, NYUD-v2 and PASCAL VOC 2012 datasets. The results show that our approach achieves superior performance.

Full Text