By bridging uncompressed-domain saliency detection and MPEG-4 AVC compression principles, the present paper advances a methodological framework for extracting the saliency maps directly from the stream syntax elements. In this respect, inside each GOP, the intensity, color, orientation and motion elementary saliency maps are related to the energy of the luma coefficients, to the energy of chroma coefficients, to the gradient of the prediction modes and to the amplitude of the motion vectors, respectively. The three spatial saliency maps are pooled according to an average formula, while the static-temporal fusion is achieved by six different formulas. The experiments consider both ground-truth and applicative evaluations. The ground-truth benchmarking investigates the relation between the predicted MPEG-4 AVC saliency map and the actual human saliency, captured by eye-tracking devices. It is based on two corpora (representing density fixation maps and saccade locations), two objective criteria (related to the closeness between the predicted and the real saliency maps and to the difference between the behavior of the predicted saliency map in fixation and random locations), two objective measures (KLD – the Kullback Leibler Divergence and AUC – the Area Under the ROC Curve) and 5 state-of-the-art saliency models (3 acting in spatial domain and 2 acting in compressed domain). The applicative validation is carried out by integrating the MPEG-4 AVC saliency map into a robust watermarking application. As an overall conclusion, the paper demonstrates that although the MPEG-4 AVC standard does not explicitly relies on any visual saliency principle, its stream syntax elements preserve this property. Four main benefits for the MPEG-4 AVC based saliency extraction are thus brought to light: (1) it outperforms (or, at least, is as good as) state-of-the-art uncompressed domain methods, (2) it allows significant gains to be obtained in watermarking transparency (for prescribed data payload and robustness), (3) it is less sensitive to the randomness in the processed visual content, and (4) it has a linear computational complexity. For instance, the ground truth results exhibit absolute relative gains between 60% and 164% in KLD, between 17% and 21% in AUC, and relative gains in KLD sensitivity between 1.18 and 6.12 and in AUC sensitivity between 1.06 and 33.7; the applicative validation brings to light transparency gains up to 10 dB in PSNR.
Read full abstract