Compositional Zero-shot Learning (CZSL) attempts to recognise images of new compositions of states and objects when images of only a subset of state-object compositions are available as training data. An example of CZSL is to recognise images of peeled apple by a model when it is trained using images of peeled orange , ripe apple and ripe orange . There are two major challenges in solving CZSL. First, the visual features of a state vary depending on the context of a state-object composition. For example state like ripe produces distinct visual properties in the compositions ripe orange and ripe banana . Hence, understanding the context dependency of state features is a necessary requirement to solve CZSL. Second, the extent of association between the features of a state and an object varies significantly in different images of same composition. For example, in different images of peeled oranges , the oranges may be peeled to different extents. As a consequence, the visual features of images of the class peeled orange may vary. Hence, there exists a significant amount of intra-class variability among the visual features of different images of a composition. Existing approaches merely look for the existence or absence of features of particular state or object in a composition. Our approach not only looks for the existence of a particular state features or object features but also the extent of association of state features and object features to better tackle the intra-class variability in visual features of compositional images. The proposed architecture is constructed using a novel Knowledge Guided Transformer . The transformer-based framework is utilised for processing larger context dependency between the state and object. Extensive experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate the superiority of the proposed approach in comparison with the state-of-the-art in both open-world and closed-world CZSL settings.
Read full abstract