Nowadays it is much easier to take a photo than ever with the smart phones, however, the state-of-the-art phones could not provide the photo content description yet. In this paper we proposed a discriminative stochastic algorithm for image content annotation and description refinement. We first segment the images into regions and then cluster them into visual blobs with a smaller number than the total training image regions. Each visual blob is regarded as a key visual word. Given the training image set with annotations, we find that the annotation process is conditioned by the selection sequence of both the semantic description word and the key visual word. The process could be described in a Markov Chain with the transition process between the candidate annotations and the visual words set. Results of the experimental evaluation demonstrates that the performance of the proposed annotation algorithm outperforms the traditional stotistic based methods.
Read full abstract