Abstract
This thesis is primarily focused on the information combination at different levels of a statistical pattern classification framework for image annotation and retrieval. Based on the previous study within the fields of image annotation and retrieval, it has been well-recognized that the low-level visual features, such as color and texture, and high-level features, such as textual description and context, are distinct yet complementary in terms of their distributions and the corresponding discriminative powers of dealing with machine-based recognition and retrieval tasks. Therefore, effective feature combination for image annotation and retrieval has become a desirable and promising perspective from which the semantic gap can be further bridged. Motivated by this fact, the combination of the visual and context modalities and that of different features in the visual domain are tackled by developing two statistical patterns classification approaches considering that the features of the visual modality and those across different modalities exhibit different degrees of heterogeneities, and thus, should be treated differently. Regarding the cross-modality feature combination, a Bayesian framework is proposed to integrate visual content and context, which has been applied to various image annotation and retrieval frameworks. In terms of the combination of different low-level features in the visual domain, the problem is tackled with a novel method that combines texture and color features via a mixture model of their joint distribution. To evaluate the proposed frameworks, many different datasets are employed in the experiments, including the COREL database for image retrieval and the MSRC, LabelMe, PASCAL VOC2009, and an animal image database collected by ourselves for image annotation. Using various evaluation criteria, the first framework is shown to be more effective than the methods purely based on the low-level features or high-level context. As for the second, the experimental results demonstrate not only its superior performance to other feature combination methods but also its ability to discover visual clusters using texture and color simultaneously. Moreover, a demo search engine based on the Bayesian framework is implemented and available online.
Highlights
1.1 BackgroundThe ever-lasting growth of multimedia information has been witnessed and experienced by human beings since the beginning of the information age
The underlying rationale of the integration is that the online observation of visual content refines the a priori information encoded in the context model, especially when there is not sufficient high-level knowledge, whereas the contextual information can be used to bridge, to some extent, the semantic gap associated with the low-level visual features
Since the two-stage image segmentation brings about the availability of the information showing whether a segment contains a foreground or background concept, we consider two types of classification/annotation for each of the above three approaches to justify the improvement resulting from the two-stage image segmentation
Summary
1.1 BackgroundThe ever-lasting growth of multimedia information has been witnessed and experienced by human beings since the beginning of the information age. An approach which can utilize both STRF and LTRF in a unified framework is desired, which are based on content and context as well Motivated by such goals, a Bayesian framework is developed in which the a priori probability, learned through a maximum entropy algorithm, represents the contextual information. The general Bayesian framework presented in the last chapter is employed, which the contextual information is induced from the characteristic audio features of different objects, while their visual features are the input of the content analysis. To address the problem of combining low-level visual descriptors for image annotation, a new generative framework is proposed and used with the supervised classification paradigm It combines different visual features by jointly modeling the descriptors extracted from the same salient point location of an image yet with their conditional distributions constrained via a single latent variable. Denote the two groups by WF and WB, the decision rule in (3.3) can be re-written as arg maxω∈W P (ω|x, I)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have