Microsoft COCO Dataset Research Articles

Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.

Read full abstract

Automatically generating a natural language description of an image is one of the most fundamental and challenging problems in Multimedia Intelligence because it translates information between two different modalities, while such translation requires the ability to understand both modalities. The existing image captioning models have already achieved remarkable performance. However, they heavily rely on the Encoder-Decoder framework is a directional translation which is hard to be further improved. In this paper, we designed the “Tell and Guess” Cooperative Learning model with a Hierarchical Refined Attention mechanism (CL-HRA) that bidirectionally improves the performance to generate more informative captions. The Cooperative Learning (CL) method combines an image caption module (ICM) with an image retrieval module (IRM) - the ICM is responsible for the “Tell” function, which generates informative and natural language descriptions for a given image. While the IRM will “Guess” and try to select that image from a lineup of images based on the given description. Such cooperation mutually improves the learning of two modules. On the other hand, the Hierarchical Refined Attention (HRA) learns to selectively attend the high-level attributes and the low-level visual features, then incorporate them into CL to fulfill the objective gaps from image to caption. The HRA can pay different attention at the different semantic levels to refine the visual representation, while the CL with the human-like mindset is more interpretable to generate a more related caption for the corresponding image. The experimental results on Microsoft COCO dataset show the effectiveness of CL-HRA in terms of several popular image caption generation metrics.

Read full abstract

Microsoft COCO Dataset Research Articles

Articles published on Microsoft COCO Dataset

Boost image captioning with knowledge reasoning

Design of an ergonomic App for entire rapid body assessment based on Mask RCNN

Transformer with sparse self‐attention mechanism for image captioning

Object Recognition with Hybrid Deep Learning Methods and Testing on Embedded Systems

Fast Vehicle and Pedestrian Detection Using Improved Mask R-CNN

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Image-Text Joint Learning for Social Images with Spatial Relation Model

Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention

Deep Regionlets: Blended Representation and Deep Learning for Generic Object Detection.

Solder Joint Recognition Using Mask R-CNN Method

Boosted Transformer for Image Captioning

Learning Transparent Object Matting

Dynamic length colour palettes

Matching Image and Sentence with Multi-faceted Representations

Evaluation of deep neural networks for traffic sign detection systems

Social Image Captioning: Exploring Visual Attention and User Attention.

Deep feature representation based on privileged knowledge transfer

The Lovász Hinge: A Novel Convex Surrogate for Submodular Losses.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Microsoft COCO Dataset Research Articles

Articles published on Microsoft COCO Dataset

Boost image captioning with knowledge reasoning

Design of an ergonomic App for entire rapid body assessment based on Mask RCNN

Transformer with sparse self‐attention mechanism for image captioning

Object Recognition with Hybrid Deep Learning Methods and Testing on Embedded Systems

Fast Vehicle and Pedestrian Detection Using Improved Mask R-CNN

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Image-Text Joint Learning for Social Images with Spatial Relation Model

Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention

Deep Regionlets: Blended Representation and Deep Learning for Generic Object Detection.

Solder Joint Recognition Using Mask R-CNN Method

Boosted Transformer for Image Captioning

Learning Transparent Object Matting

Dynamic length colour palettes

Matching Image and Sentence with Multi-faceted Representations

Evaluation of deep neural networks for traffic sign detection systems

Social Image Captioning: Exploring Visual Attention and User Attention.

Deep feature representation based on privileged knowledge transfer

The Lovász Hinge: A Novel Convex Surrogate for Submodular Losses.