Object detection and tracking is an essential preliminary task in event analysis systems (e.g. visual surveillance). Typically objects are extracted and tagged, forming representative tracks of their activity. Tagging is usually performed by probabilistic data association, however, in systems capturing disjoint areas it is often not possible to establish such associations, as data may have been collected at different times or in different locations. In this case, appearance matching is a valuable aid. We propose using bag-of-visterms, i.e. an histogram of quantized local feature descriptors, to represent and match tracked objects. This method has proven to be effective for object matching and classification in image retrieval applications, where descriptors can be extracted a priori. An important difference in event analysis systems is that relevant information is typically restricted to the foreground. Descriptors can, therefore, be extracted faster, approaching real-time requirements. Also, unlike image retrieval, objects can change over time and therefore their model needs to be updated continuously. Incremental or adaptive learning is used to tackle this problem. Using independent tracks of 30 different persons, we show that the bag-of-visterms representation effectively discriminates visual object tracks and that it presents high resilience to incorrect object segmentation. Additionally, this methodology allows the construction of scalable object models that can be used to match tracks across independent views.
Read full abstract