Image-Captioning Model Compression

Viktar Atliha,Dmitrij Šešok

doi:10.3390/app12031638

Abstract

Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efficient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications.

Highlights

One of the most significant tasks combining two different domains such as computer vision (CV) and natural language processing (NLP) is the image-captioning task [1]
More complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15,16,17,18] and images [19]
For the AoANet model, the size reduction was 95.6% from 791.8 MB to 34.8 MB, the CIDEr and SPICE metrics fell by 1.7% and 4%, respectively, from 129.8 to 127.6 and from 22.4 to 21.5, respectively

Summary

Introduction

One of the most significant tasks combining two different domains such as CV and NLP is the image-captioning task [1]. The description should contain a listing of the objects in this image, but should take into account their signs, interactions between them, etc., in order for this description to be as humanlike as possible. Image-captioning models are based on encoder–decoder architecture. More complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15,16,17,18] and images [19]

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 4, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Image-Captioning Model Compression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Visual Question Answering Using Deep Learning
Deepika Kanakamedala ... Subetha T
-
Deepika Kanakamedala, et. al.Deepika Kanakamedala ... Subetha T
27 Nov 2021
27 Nov 2021

Graph convolutional networks in language and vision: A survey
Haotian Ren ... Dingyi Fang
Knowledge-Based Systems | VOL. 251
Haotian Ren, et. al.Haotian Ren ... Dingyi Fang
17 Jun 2022
Knowledge-Based Systems | VOL. 251

A Comparison between Vgg16 and Xception Models used as Encoders for Image Captioning
Asrar Almogbil ... Amjad Alghamdi
-
Asrar Almogbil, et. al.Asrar Almogbil ... Amjad Alghamdi
30 Jul 2022
30 Jul 2022

A comparative study of resource usage for speaker recognition techniques
Pulkit Verma ... Pradip K Das
-
Pulkit Verma, et. al.Pulkit Verma ... Pradip K Das
01 Dec 2016
01 Dec 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Image-Captioning Model Compression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences