Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications

Concetto Spampinato ,Benoît Huet ,Bas Boom

doi:10.1145/2501105

Abstract

In the development of computer vision applications, a fundamental role is played by the availability of large datasets of annotated images and videos (ground truth) providing a wide coverage of different scenarios and environments. These are used both to train machine-learning approaches, which have been largely and successfully adopted for computer vision, but still strongly suffer the lack of comprehensive, large-scale training data, and to evaluate algorithms' performance, which has to provide enough evidence, to the developers and especially to peer scientists reviewing the work, that a method works well in the targeted environment and conditions. The main limitation to collect large scale ground truth is the daunting amount of time and human effort needed to generate high quality ground truth; in fact, it has been estimated that labeling an image may take from two to thirty minutes, depending on the task, and this is, obviously, even worse in the case of videos. Currently, most available datasets with the related ground truth are produced as the result of efforts of single research groups who have manually annotated such datasets, which, however, are too task-oriented and cannot be generalized. Moreover, the large-scale ground truth gathering approaches, which have been experimented so far, suffer from many limitations, from incomplete or low-quality annotations (due to the lack of quality control) to interoperability issues, since no common representation schema has been adopted yet. In addition, it is not always trivial to identify metrics for performance evaluation. A notable case is object tracking, for which some research groups have developed self-evaluation-based approaches. Therefore, the availability of massive ground truth would allow the development of such methods and make them in the long run independent of ground truth; this would be inline with the current wave of scientific development, which is "data-driven" in contrast to theory or simulation driven. The aim of this workshop is to present and report on the most recent methods to support automatic or semi-automatic ground truth annotation and labeling as well as algorithms' performance evaluation and comparison in many applications such as object detection, object recognition, scene segmentation and face recognition both in still images and in videos. More specifically, the workshop will bring together researchers in computer vision, machine learning and semantic web to share and collect ideas with the aim to allow researchers to model and keep track of the whole research process, from dataset construction to performance evaluation.

Full Text