Abstract
Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efficient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications.
Highlights
One of the most significant tasks combining two different domains such as computer vision (CV) and natural language processing (NLP) is the image-captioning task [1]
More complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15,16,17,18] and images [19]
For the AoANet model, the size reduction was 95.6% from 791.8 MB to 34.8 MB, the CIDEr and SPICE metrics fell by 1.7% and 4%, respectively, from 129.8 to 127.6 and from 22.4 to 21.5, respectively
Summary
One of the most significant tasks combining two different domains such as CV and NLP is the image-captioning task [1]. The description should contain a listing of the objects in this image, but should take into account their signs, interactions between them, etc., in order for this description to be as humanlike as possible. Image-captioning models are based on encoder–decoder architecture. More complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15,16,17,18] and images [19]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.