Abstract
Local visual pattern modelling is one of the most important problems in the era of hand-crafted features where the aim is to bridge the gap between a set of local visual representations and some high-level vision tasks such as image and/or video classification. Recent years witnesses the over-whelming success of deep Convolutional Neural Networks (CNN), which poses new challenges and open problems for local pattern modelling. The thesis focuses on the following problems. They are 1) How to leverage the CNN architecture to effectively model the local patterns? 2) How to combine the merits of supervised coding and Fisher vector coding, which are two state-f-the-art coding schemes, to model the high-dimensional local representations, e.g., CNN descriptors, to lead to better classification performance? 3) How to discard the strong assumption that local patterns are distributed i.i.d. and model the dependencies between region-level CNN features for some high-level tasks? For the first question, we propose two CNN architectures to model the local features of videos for action recognition. While one piece of work focuses on how to design a network to effectively encode and aggregate multiple kinds of local features of a video, the other work proposes a convolutional pooling strategy to explore the temporal information hidden within the frame-level representations. These two works raise flexible CNN architectures that are compatible with video format and lead to promising action recognition performance. Supervised encoding and Fisher vector encoding are two representative schemes to create im- age representations. Both of them can achieve state-of-the-art image classification performance but through different strategies: the former extracts discriminative patterns from local features at the encoding stage while the latter preserves rich information into high-dimensional signatures derived from a generative model of local features. For the second problem, we propose a hybrid Fisher vector encoding scheme for image classification which combines the strategies from both of the above two encoding methods. The key idea is to leverage supervised encoding to decompose local features into a discriminative part and a residual part and then build a generative model based on this decomposition. For the third problem, we study a challenging problem of identifying unusual instances of known objects in images within an “open-world ” setting. That is, we aim to find objects that are members of a known class, but which are not typical of that class. We propose to identify unusual objects by inspecting the distributions of local visual patterns at multiple image regions. Considering the promising performance of Region CNN [37], we represent an image by a set of local CNN features and then map them into scalar detection scores to get rid of the distraction influence of irrelevant content. To model the region-level score distribution we propose to use Gaussian Process (GP) to iii construct two separate generative models, one for “regular object” and the other for “other objects”. We design a new covariance function to simultaneously model the detection score at a single location and the score dependencies between multiple regions. This treatment allows our method to capture the spatial dependencies between local regions, which turns out to be crucial for identifying unusual objects
Paper version not known (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have