Towards Mapping Images to Text Using Deep-Learning Architectures

Daniela Onita,Adriana Birlutiu,Liviu P Dinu

doi:10.3390/math8091606

Abstract

Images and text represent types of content that are used together for conveying a message. The process of mapping images to text can provide very useful information and can be included in many applications from the medical domain, applications for blind people, social networking, etc. In this paper, we investigate an approach for mapping images to text using a Kernel Ridge Regression model. We considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches. We investigated several neural network architectures for image feature extraction: VGG16, Inception V3, ResNet50, Xception. The experimental evaluation was performed on three data sets from different domains. The texts associated with images represent objective descriptions for two of the three data sets and subjective descriptions for the other data set. The experimental results show that the more complex deep-learning approaches that were used for feature extraction perform better than simple RGB pixel-value approaches. Moreover, the ResNet50 network architecture performs best in comparison to the other three deep network architectures considered for extracting image features. The model error obtained using the ResNet50 network is less by approx. 0.30 than other neural network architectures. We extracted natural language descriptors of images and we made a comparison between original and generated descriptive words. Furthermore, we investigated if there is a difference in performance between the type of text associated with the images: subjective or objective. The proposed model generated more similar descriptions to the original ones for the data set containing objective descriptions whose vocabulary is simpler, bigger and clearer.

Highlights

A quick look at an image is sufficient for a human to say a few words related to that image.this very easy task for humans is a very difficult task for existing computer vision systems.The majority of previous work in computer vision [1,2,3,4] has focused on labeling images with a fixed set of visual categories
We investigated a method for mapping images to text in different real-world scenarios
To confirm the potential of deep-learning techniques for mapping images to text, we considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches

Summary

Introduction

A quick look at an image is sufficient for a human to say a few words related to that image.this very easy task for humans is a very difficult task for existing computer vision systems.The majority of previous work in computer vision [1,2,3,4] has focused on labeling images with a fixed set of visual categories. A quick look at an image is sufficient for a human to say a few words related to that image. This very easy task for humans is a very difficult task for existing computer vision systems. The majority of previous work in computer vision [1,2,3,4] has focused on labeling images with a fixed set of visual categories. Even though closed vocabularies of visual concepts are a convenient modeling assumption, they are quite restrictive when compared to the vast amount of rich descriptions and impressions that a human can compose. We want to take a step forward towards the goal of generating descriptions of the images that are close to the natural language.

Methods

Results

Conclusion