Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval
Composing Text and Image to Image Retrieval ( CTI-IR ) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR . The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
- Research Article
21
- 10.1016/j.jvcir.2014.01.004
- Jan 17, 2014
- Journal of Visual Communication and Image Representation
Statistical distributional approach for scale and rotation invariant color image retrieval using multivariate parametric tests and orthogonality condition
- Research Article
226
- 10.1186/s40537-021-00414-0
- Jan 1, 2021
- Journal of Big Data
Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets.In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms.
- Conference Article
- 10.5121/csit.2014.4912
- Sep 13, 2014
A novel method for color image retrieval based on statistical non-parametric tests such as two-sample Wald Test for equality of variance and Man-Whitney U test is proposed in this paper. The proposed method tests the deviation, i.e. distance in terms of variance between the query and target images; if the images pass the test, then it is proceeded to test the spectrum of energy, i.e. distance between the mean values of the two images; otherwise, the test is dropped. If the query and target images pass the tests then it is inferred that the two images belong to the same class, i.e. both the images are same; otherwise, it is assumed that the images belong to different classes, i.e. both images are different. The obtained test statistic values are indexed in ascending order and the image corresponds to the least value is identified as same or similar images. Here, either the query image or the target image is treated as sample; the other is treated as population. Also, some other features such as Coefficient of Variation, Skewness, Kurtosis, Variance, and Spectrum of Energy are compared between the query and target images color-wise. The proposed method is robust for scaling and rotation, since it adjusts itself and treats either the query image or the target image is the sample of other. The results obtained are comparable with the existing methods. Keywords—Variance, mean, query image, target image, non-parametric tests.
- Research Article
34
- 10.1007/s11263-020-01411-1
- Jan 5, 2021
- International Journal of Computer Vision
Scene text recognition is an important task in computer vision. Despite tremendous progress achieved in the past few years, issues such as varying font styles, arbitrary shapes and complex backgrounds etc. have made the problem very challenging. In this work, we propose to improve text recognition from a new perspective by separating the text content from complex backgrounds, thus making the recognition considerably easier and significantly improving recognition accuracy. To this end, we exploit the generative adversarial networks (GANs) for removing backgrounds while retaining the text content . As vanilla GANs are not sufficiently robust to generate sequence-like characters in natural images, we propose an adversarial learning framework for the generation and recognition of multiple characters in an image. The proposed framework consists of an attention-based recognizer and a generative adversarial architecture. Furthermore, to tackle the issue of lacking paired training samples, we design an interactive joint training scheme, which shares attention masks from the recognizer to the discriminator, and enables the discriminator to extract the features of each character for further adversarial training. Benefiting from the character-level adversarial training, our framework requires only unpaired simple data for style supervision. Each target style sample containing only one randomly chosen character can be simply synthesized online during the training. This is significant as the training does not require costly paired samples or character-level annotations. Thus, only the input images and corresponding text labels are needed. In addition to the style normalization of the backgrounds, we refine character patterns to ease the recognition task. A feedback mechanism is proposed to bridge the gap between the discriminator and the recognizer. Therefore, the discriminator can guide the generator according to the confusion of the recognizer, so that the generated patterns are clearer for recognition. Experiments on various benchmarks, including both regular and irregular text, demonstrate that our method significantly reduces the difficulty of recognition. Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.
- Research Article
58
- 10.1016/j.patcog.2020.107440
- Jun 5, 2020
- Pattern Recognition
Generative attention adversarial classification network for unsupervised domain adaptation
- Research Article
56
- 10.1007/s11263-020-01321-2
- Mar 21, 2020
- International journal of computer vision
Generative adversarial networks (GAN) are widely used in medical image analysis tasks, such as medical image segmentation and synthesis. In these works, adversarial learning is directly applied to the original supervised segmentation (synthesis) networks. The usage of adversarial learning is effective in improving visual perception performance since adversarial learning works as realistic regularization for supervised generators. However, the quantitative performance often cannot improve as much as the qualitative performance, and it can even become worse in some cases. In this paper, we explore how we can take better advantage of adversarial learning in supervised segmentation (synthesis) models and propose an adversarial confidence learning framework to better model these problems. We analyze the roles of discriminator in the classic GANs and compare them with those in supervised adversarial systems. Based on this analysis, we propose adversarial confidence learning, i.e., besides the adversarial learning for emphasizing visual perception, we use the confidence information provided by the adversarial network to enhance the design of supervised segmentation (synthesis) network. In particular, we propose using a fully convolutional adversarial network for confidence learning to provide voxel-wise and region-wise confidence information for the segmentation (synthesis) network. With these settings, we propose a difficulty-aware attention mechanism to properly handle hard samples or regions by taking structural information into consideration so that we can better deal with the irregular distribution of medical data. Furthermore, we investigate the loss functions of various GANs and propose using the binary cross entropy loss to train the proposed adversarial system so that we can retain the unlimited modeling capacity of the discriminator. Experimental results on clinical and challenge datasets show that our proposed network can achieve state-of-the-art segmentation (synthesis) accuracy. Further analysis also indicates that adversarial confidence learning can both improve the visual perception performance and the quantitative performance.
- Conference Article
5
- 10.1109/etfa45728.2021.9613282
- Sep 7, 2021
Joining element and assembly design remain largely a manual process. This increases risks of more costly and longer development trajectories. Current automation solutions do not consider historical data and traditional machine learning approaches have limitations. Meanwhile, generative adversary networks became benchmark methodologies to perform generation tasks in computer vision. Products in manufacturing industry may contain thousands of spot welds, thus design automation enables engineers to focus on their core competencies. This work presents a methodology to predict spot weld locations using generative adversarial networks. A 2D-based approach implements a variant of StarGAN_v2 to predict locations. It uses domain-based prediction concepts that integrate clustering of geometrical and product manufacturing information, as well as reference guided style generation. Results indicate that generative adversarial networks can predict spot weld positions based on 2D image data.
- Conference Article
589
- 10.1109/iccv.2017.606
- Oct 1, 2017
Semantic segmentation has been a long standing challenging task in computer vision. It aims at assigning a label to each image pixel and needs a significant number of pixel-level annotated data, which is often unavailable. To address this lack of annotations, in this paper, we leverage, on one hand, a massive amount of available unlabeled or weakly labeled data, and on the other hand, non-real images created through Generative Adversarial Networks. In particular, we propose a semi-supervised framework – based on Generative Adversarial Networks (GANs) – which consists of a generator network to provide extra training examples to a multi-class classifier, acting as discriminator in the GAN framework, that assigns sample a label y from the K possible classes or marks it as a fake sample (extra class). The underlying idea is that adding large fake visual data forces real samples to be close in the feature space, which, in turn, improves multiclass pixel classification. To ensure a higher quality of generated images by GANs with consequently improved pixel classification, we extend the above framework by adding weakly annotated data, i.e., we provide class level information to the generator. We test our approaches on several challenging benchmarking visual datasets, i.e. PASCAL, SiftFLow, Stanford and CamVid, achieving competitive performance compared to state-of-the-art semantic segmentation methods.
- Research Article
3
- 10.33140/amlai.05.02.11
- Jun 25, 2024
- Advances in Machine Learning & Artificial Intelligence
Object detection is a basic task in computer vision with numerous applications ranging from surveillance and autonomous driving to medical imaging and augmented reality. Recently, Machine and Deep learning approaches have significantly advanced the State-of-the-Art in object detection, enabling remarkable progress in accuracy, robustness, and efficiency. This paper presents a detailed review of recent researches and developments in Computer Vision, Object Detection and Sensing Techniques. We discuss key concepts, methodologies, and challenges in object detection, focusing on deep learning-based approaches. Additionally, we explore emerging trends such as instance segmentation, few-shot learning, and privacy- preserving techniques in object detection. Furthermore, we discuss benchmark datasets, evaluation metrics, and open research challenges in the field. Keeping in view the current researches and Research Techniques, this research claims to guide researchers and enthusiasts towards understanding the latest advancements and future directions in this exciting area of computer vision. We discuss key topics such as image classification, object detection, image segmentation, and scene understanding. The rapid progress in deep learning has revolutionized computer vision, enabling models to learn hierarchical representations directly from data. We review prominent deep learning architectures such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), and their applications in various computer vision tasks. Furthermore, we explore recent developments in multi-modal and cross-modal learning, domain adaptation, and interpretability in computer vision models. Additionally, we discuss challenges such as data bias, ethical considerations, and scalability issues faced by the field. By providing a comprehensive overview, this paper aims to inspire further research and innovation in computer vision, advancing its capabilities and broadening its impact on society
- Research Article
106
- 10.1109/tgrs.2020.3020804
- Sep 23, 2020
- IEEE Transactions on Geoscience and Remote Sensing
The accuracy of remote sensing image segmentation and classification is known to dramatically decrease when the source and target images are from different sources; while deep learning-based models have boosted performance, they are only effective when trained with a large number of labeled source images that are similar to the target images. In this article, we propose a generative adversarial network (GAN) based domain adaptation for land cover classification using new target remote sensing images that are enormously different from the labeled source images. In GANs, the source and target images are fully aligned in the image space, feature space, and output space domains in two stages via adversarial learning. The source images are translated to the style of the target images, which are then used to train a fully convolutional network (FCN) for semantic segmentation to classify the land cover types of the target images. The domain adaptation and segmentation are integrated to form an end-to-end framework. The experiments that we conducted on a multisource data set covering more than 3500 km <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> with 51 560 256×256 high-resolution satellite images in Wuhan city and a cross-city data set with 11 383 256×256 aerial images in Potsdam and Vaihingen demonstrated that our method exceeded the recent GAN-based domain adaptation methods by at least 6.1% and 4.9% in the mean intersection over union (mIoU) and overall accuracy (OA) indexes, respectively. We also proved that our GAN is a generic framework that can be implemented for other domain transfer methods to boost their performance.
- Research Article
- 10.32628/cseit124102119
- Apr 30, 2024
- International Journal of Scientific Research in Computer Science, Engineering and Information Technology
A key task in computer vision is image retrieval, which has wide-ranging applications across multiple domains. Using query picture features, this abstract proposes an image retrieval approach that focuses on the widely used Scale-Invariant Feature Transform (SIFT) algorithm for feature extraction and distance calculation. The suggested method starts by extracting SIFT features from a set of photos, building a keypoints and descriptor database. By capturing the unique qualities of nearby image regions, these features enable reliable matching and retrieval. The security of private cloud data, which includes query queries, the search tree, and outsourced photographs, is another major issue. Use a feature extraction method first for integrated picture features, which are composed of fundamental components like colour and shape. In particular, because the proposed method uses a balancing index tree, it can achieve logarithmic search time. Second, the picture and query feature are encrypted using the secure inner product. Include a system for determining duplicate picture material as well. When given a query image, SIFT feature extraction is applied to it, producing keypoints and descriptors that correspond to its visual characteristics. Next, a distance computation method like Euclidean distance is used to compare the features of the database photos with the query image features. The similarity between the query image and the database images is measured through this comparison. The retrieved photos are ranked in order of similarity to the query image based on the calculated distances. Search results with lesser distances between images are deemed more similar and are displayed first. To locate visually related images in huge databases, the image retrieval system that uses SIFT feature extraction and distance calculation provides a reliable and effective solution. It contributes to developments in multimedia retrieval, visual analytics, and picture understanding by enabling applications like content-based image search, image recommendation systems, and image clustering.
- Conference Article
647
- 10.1109/cvpr.2019.00160
- Jun 1, 2019
Generating an image from a given text description has two goals: visual realism and semantic consistency. Although significant progress has been made in generating high-quality and visually realistic images using generative adversarial networks, guaranteeing semantic consistency between the text description and visual content remains very challenging. In this paper, we address this problem by proposing a novel global-local attentive and semantic-preserving text-to-image-to-text framework called MirrorGAN. MirrorGAN exploits the idea of learning text-to-image generation by redescription and consists of three modules: a semantic text embedding module (STEM), a global-local collaborative attentive module for cascaded image generation (GLAM), and a semantic text regeneration and alignment module (STREAM). STEM generates word- and sentence-level embeddings. GLAM has a cascaded architecture for generating target images from coarse to fine scales, leveraging both local word attention and global sentence attention to progressively enhance the diversity and semantic consistency of the generated images. STREAM seeks to regenerate the text description from the generated image, which semantically aligns with the given text description. Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods.
- Research Article
16
- 10.1007/s10489-022-03378-7
- Apr 1, 2022
- Applied Intelligence
Classification of Bitcoin entities is an important task to help Law Enforcement Agencies reduce anonymity in the Bitcoin blockchain network and to detect classes more tied to illegal activities. However, this task is strongly conditioned by a severe class imbalance in Bitcoin datasets. Existing approaches for addressing the class imbalance problem can be improved considering generative adversarial networks (GANs) that can boost data diversity. However, GANs are mainly applied in computer vision and natural language processing tasks, but not in Bitcoin entity behaviour classification where they may be useful for learning and generating synthetic behaviours. Therefore, in this work, we present a novel approach to address the class imbalance in Bitcoin entity classification by applying GANs. In particular, three GAN architectures were implemented and compared in order to find the most suitable architecture for generating Bitcoin entity behaviours. More specifically, GANs were used to address the Bitcoin imbalance problem by generating synthetic data of the less represented classes before training the final entity classifier. The results were used to evaluate the capabilities of the different GAN architectures in terms of training time, performance, repeatability, and computational costs. Finally, the results achieved by the proposed GAN-based resampling were compared with those obtained using five well-known data-level preprocessing techniques. Models trained with data resampled with our GAN-based approach achieved the highest accuracy improvements and were among the best in terms of precision, recall and f1-score. Together with Random Oversampling (ROS), GANs proved to be strong contenders in addressing Bitcoin class imbalance and consequently in reducing Bitcoin entity anonymity (overall and per-class classification performance). To the best of our knowledge, this is the first work to explore the advantages and limitations of GANs in generating specific Bitcoin data and “attacking” Bitcoin anonymity. The proposed methods ultimately demonstrate that in Bitcoin applications, GANs are indeed able to learn the data distribution and generate new samples starting from a very limited class representation, which leads to better detection of classes related to illegal activities.
- Research Article
- 10.21917/ijivp.2011.0024
- Feb 1, 2011
- ICTACT Journal on Image and Video Processing
This paper proposes a simple but efficient scheme for colour image retrieval, based on statistical tests of hypothesis, namely test for equality of variance, test for equality of mean. The test for equality of variance is performed to test the similarity of the query and target images. If the images pass the test, then the test for equality of mean is performed on the same images to examine whether the two images have the same attributes / characteristics. If the query and target images pass the tests then it is inferred that the two images belong to the same class i.e. both the images are same; otherwise, it is assumed that the images belong to different classes i.e. both the images are different. The obtained test statistic values are indexed in ascending order and the image corresponding to the least value is identified as same / similar images. The proposed system is invariant for translation, scaling, and rotation, since the proposed system adjusts itself and treats either the query image or the target image is sample of other. The proposed scheme provides cent percent accuracy if the query and target images are same, whereas there is a slight variation for similar, transformed.
- Research Article
9
- 10.1016/j.inffus.2024.102632
- Aug 14, 2024
- Information Fusion
Exploring adversarial deep learning for fusion in multi-color channel skin detection applications