Codebooks are a widely accepted technique to recognise objects by sets of local features. The method has been applied to many classes of objects, even very abstract ones. But although state of the art recognition rates have been reported, the method is still far away from being reliable in any sense that is related to human vision. The literature on this topic emphasises detailed descriptions of statistical estimators over a basic analysis of the data. A deeper understanding of the data is however needed to achieve a further development of the field. In this paper, we therefore present a set of quantitative experiments on codebooks of the popular SIFT descriptors. The results discourage the use of illustrative but overly simplifying descriptions of the visual words approach. It is in particular demonstrated that (1) there are more visually distinct patterns than can be listed in a codebook, (2) one element of a codebook represents a set of many, visually distinct patterns, and (3) there are no single, selective SIFT descriptors to serve as codebook elements. This makes us wonder why the method works after all. We discuss several options.
Read full abstract