Deep learning ensemble with data augmentation using a transcoder in visual description

Jin Young Lee

doi:10.1007/s11042-019-07948-9

Abstract

Visual description is very challenging work in computer vision. Since it is usually performed with compressed videos, its performance strongly depends on coding distortion. Therefore, it is very important that visual description networks are trained using video datasets with both high and low qualities. In order to generate them from a given training dataset, this paper introduces a new data augmentation method employing a transcoder. It converts one video quality into another by controlling a quantization parameter (QP). Two different networks are trained on the high and low quality videos, respectively, and then the proposed deep learning ensemble model determines optimum sentence among candidates generated from these networks. Experimental results show that the proposed method is very robust to the coding distortion.

Full Text