Abstract

Visual description is very challenging work in computer vision. Since it is usually performed with compressed videos, its performance strongly depends on coding distortion. Therefore, it is very important that visual description networks are trained using video datasets with both high and low qualities. In order to generate them from a given training dataset, this paper introduces a new data augmentation method employing a transcoder. It converts one video quality into another by controlling a quantization parameter (QP). Two different networks are trained on the high and low quality videos, respectively, and then the proposed deep learning ensemble model determines optimum sentence among candidates generated from these networks. Experimental results show that the proposed method is very robust to the coding distortion.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.