Abstract
Visual description is very challenging work in computer vision. Since it is usually performed with compressed videos, its performance strongly depends on coding distortion. Therefore, it is very important that visual description networks are trained using video datasets with both high and low qualities. In order to generate them from a given training dataset, this paper introduces a new data augmentation method employing a transcoder. It converts one video quality into another by controlling a quantization parameter (QP). Two different networks are trained on the high and low quality videos, respectively, and then the proposed deep learning ensemble model determines optimum sentence among candidates generated from these networks. Experimental results show that the proposed method is very robust to the coding distortion.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.