Global and Local Semantic Completion Learning for Vision-Language Pre-Training.
Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.
- Conference Article
4
- 10.1109/icaice54393.2021.00047
- Nov 1, 2021
In recent years, a great improvement has been achieved in cross-modal person re-identification (Re-ID) methods based on feature partition. However, many works do not use global and local features jointly to improve the accuracy of person identification. It is an important research topic to fully extract and use global features as well as local features, and effectively reduce modality differences. In this paper, we propose an adversarial learning based on global and local features (ALGL) method. We adopt a two-stream network with partially shared parameters as a feature extraction network to extract visible and infrared feature maps. Local features are obtained through Part-based Convolutional Baseline (PCB) operations on feature maps with the local feature learning module. In the global feature learning module, the average pooling is used to obtain the global features. In order to fully explore the discriminative abilities of local features and global features, hetero-center based triplet loss is designed, which brings features of the same category closer, and features of different categories farther away. At the same time, the adversarial learning module minimizes the modality difference between visible and infrared modalities. Experimental results on the SYSU-MM01 and RegDB datasets show that ALGL outperforms the state-of-the-art solutions.
- Research Article
24
- 10.1109/tip.2020.2965306
- Jan 1, 2020
- IEEE Transactions on Image Processing
Practically, it is more feasible to collect compact visual features rather than the video streams from hundreds of thousands of cameras into the cloud for big data analysis and retrieval. Then the problem becomes which kinds of features should be extracted, compressed and transmitted so as to meet the requirements of various visual tasks. Recently, many studies have indicated that the activations from the convolutional layers in convolutional neural networks (CNNs) can be treated as local deep features describing particular details inside an image region, which are then aggregated (e.g., using Fisher Vectors) as a powerful global descriptor. Combination of local and global features can satisfy those various needs effectively. It has also been validated that, if only local deep features are coded and transmitted to the cloud while the global features are recovered using the decoded local features, the aggregated global features should be lossy and consequently would degrade the overall performance. Therefore, this paper proposes a joint coding framework for local and global deep features (DFJC) extracted from videos. In this framework, we introduce a coding scheme for real-valued local and global deep features with intra-frame lossy coding and inter-frame reference coding. The theoretical analysis is performed to understand how the number of inliers varies with the number of local features. Moreover, the inter-feature correlations are exploited in our framework. That is, local feature coding can be accelerated by making use of the frame types determined with global features, while the lossy global features aggregated with the decoded local features can be used as a reference for global feature coding. Extensive experimental results under three metrics show that our DFJC framework can significantly reduce the bitrate of local and global deep features from videos while maintaining the retrieval performance.
- Conference Article
176
- 10.1109/cvpr52688.2022.01522
- Jun 1, 2022
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common downstream vision-language tasks such as image-text retrieval and visual question answering.
- Conference Article
10
- 10.1109/icme.2019.00125
- Jul 1, 2019
Deep part-based methods in recent literature have revealed the great potential of learning local part-level representation for pedestrian image in the task of person re-identification. However, global features that capture discriminative holistic information of human body are usually ignored or not well exploited. This motivates us to investigate joint learning global and local features from pedestrian images. Specifically, in this work, we propose a novel framework termed tree branch network (TBN) for person re-identification. Given a pedestrain image, the feature maps generated by the backbone CNN, are partitioned recursively into several pieces, each of which is followed by a bottleneck structure that learns finer-grained features for each level in the hierarchical tree-like framework. In this way, representations are learned in a coarse-to-fine manner and finally assembled to produce more discriminative image descriptions. Experimental results demonstrate the effectiveness of the global and local feature learning method in the proposed TBN framework. We also show significant improvement in performance over state-of-the-art methods on three public benchmarks: Market-1501, CUHK-03 and DukeMTMC.
- Research Article
18
- 10.1016/j.eswa.2017.07.018
- Jul 13, 2017
- Expert Systems with Applications
Face alignment using a deep neural network with local feature learning and recurrent regression
- Peer Review Report
1
- 10.7554/elife.78635.sa2
- Oct 10, 2022
Author response: A connectomics-based taxonomy of mammals
- Research Article
15
- 10.1049/cit2.12001
- Mar 1, 2021
- CAAI Transactions on Intelligence Technology
In recent years, vehicle re‐identification has attracted more and more attention. How to learn the discriminative information from multi‐view vehicle images becomes one of the challenging problems in vehicle re‐identification field. For example, when the viewpoint of the image changes, the features extracted from one image may be lost in another image. A two‐branch network with pyramid‐based local and spatial attention global feature learning (PSA) is proposed for vehicle re‐identification to solve this issue. Specifically, one branch learns local features at different scales by building pyramid from coarse to fine and the other branch learns attentive global features by using spatial attention module. Subsequently, pooling operation by using global maximum pooling (GMP) for local features and global average pooling (GAP) for global feature is performed. Finally, local feature vectors and global feature vector extracted from the last pooling layer, respectively, are employed for identity re‐identification. The experimental results demonstrate that the proposed method achieves state‐of‐the‐art results on the VeRi‐776 dataset and VehicleID dataset.
- Research Article
16
- 10.1016/j.engappai.2024.108248
- Mar 15, 2024
- Engineering Applications of Artificial Intelligence
Global–local feature learning for fine-grained food classification based on Swin Transformer
- Research Article
37
- 10.1016/j.neunet.2021.02.005
- Feb 25, 2021
- Neural Networks
Dense Residual Network: Enhancing global dense feature flow for character recognition
- Research Article
25
- 10.1007/s11042-019-7674-5
- May 15, 2019
- Multimedia Tools and Applications
Human action recognition is a fundamental and challenging building block for many computer vision applications. It has been included in many applications such as: video surveillance, human computer interaction and multimedia retrieval systems. Various approaches have been proposed to solve human action recognition problem. Among others, moment-based methods considered as one of the most simple and successful approach. However, moment-based methods take into consideration only global features while neglect the discriminative properties of local features. In this paper, we propose a new efficient method which combine both Global and Local Zernike Moment (GLZM) features based on Bag-of-Features (BoF) technique. Since using only global features are not sufficient to discriminate similar actions like running, walking and jogging, augmenting these features with localized features helps to improve the recognition accuracy. The proposed method first calculate local temporal Motion Energy Images (MEI) by accumulating frame differences of short time consecutive frames. Then, global and local features are calculated using Zernike moments with different polynomial orders to represent global and local motion patterns respectively. Global features are calculated from the whole region of the human performing action while local features focused on localized regions of the human in order to represent local motion information. Both local and global features are preprocessed using whitening transformation, then bag-of-features algorithm is employed to combine those pool of features and represent each action using new GLZM feature descriptor. Finally, we use multi-class Support Vector Machine (SVM) classifier to recognize human actions. In order to validate the proposed method, we perform a set of experiments using three publicly available datasets: Weizmann, KTH and UCF sports action. Experimental results using leave-one-out strategy show that proposed method achieves promising results compared with other state-of-the-art methods.
- Research Article
60
- 10.1118/1.598011
- Jun 1, 1997
- Medical Physics
We investigated the application of multiresolution global and local texture features to reduce false-positive detection in a computerized mass detection program. One hundred and sixty-eight digitized mammograms were randomly and equally divided into training and test groups. From these mammograms, two datasets were formed. The first dataset (manual) contained four regions of interest (ROIs) selected manually from each of the mammograms. One of the four ROIs contained a biopsy-proven mass and the other three contained normal parenchyma, including dense, mixed dense/fatty, and fatty tissues. The second dataset (hybrid) contained the manually extracted mass ROIs, along with normal tissue ROIs extracted by an automated Density-Weighted Contrast Enhancement (DWCE) algorithm as false-positive detections. A wavelet transform was used to decompose an ROI into several scales. Global texture features were derived from the low-pass coefficients in the wavelet transformed images. Local texture features were calculated from the suspicious object and the peripheral subregions. Linear discriminant models using effective features selected from the global, local, or combined feature spaces were established to maximize the separation between masses and normal tissue. Receiver Operating Characteristic (ROC) analysis was conducted to evaluate the classifier performance. The classification accuracy using global features were comparable to that using local features. With both global and local features, the average area, Az, under the test ROC curve, reached 0.92 for the manual dataset and 0.96 for the hybrid dataset, demonstrating statistically significant improvement over those obtained with global or local features alone. The results indicated the effectiveness of the combined global and local features in the classification of masses and normal tissue for false-positive reduction.
- Research Article
112
- 10.1016/j.jclepro.2014.06.056
- Jun 26, 2014
- Journal of Cleaner Production
The effect of local and global learning on the cost of renewable energy in developing countries
- Conference Article
107
- 10.1109/iccv48922.2021.01156
- Oct 1, 2021
Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their local features. Previous learning-based studies mainly focus on either global or local image representation learning to tackle the retrieval task. In this paper, we abandon the two-stage paradigm and seek to design an effective single-stage solution by integrating local and global information inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention at first. Components orthogonal to the global image representation are then extracted from the local information. At last, the orthogonal components are concatenated with the global representation as a complementary, and then aggregation is performed to generate the final representation. The whole framework is end-to-end differentiable and can be trained with image-level labels. Extensive experimental results validate the effectiveness of our solution and show that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
- Research Article
- 10.11834/jig.211170
- Jan 1, 2023
- Journal of Image and Graphics
目的 由于现有时尚服饰搭配方法缺乏服饰图像局部细节的有效特征表示,难以对不同服饰间的局部兼容性进行建模,限制了服饰兼容性学习的完备性,导致时尚服饰搭配的准确率较低。因此,提出一种全局—局部特征优化的时尚服饰搭配方法。方法 首先,利用不同卷积网络提取时尚服饰的图像和文本特征作为全局特征,同时在卷积网络基础上构建局部特征提取网络,提取时尚服饰图像的局部特征;然后,基于图网络和自注意力机制构建全局—局部兼容性学习模块,通过学习不同时尚服饰全局特征间和局部特征间的交互关系,并定义不同时尚服饰的权重,进行服饰全局和局部兼容性建模;最后,构建服饰搭配优化模型,通过融合套装中所有服饰的全局和局部兼容性优化服饰搭配,并计算搭配得分,输出正确的服饰搭配结果。结果 在公开数据集Polyvore上将本文方法与其他方法进行对比。实验结果表明,利用局部特征提取网络提取的时尚服饰图像局部特征能有效地表示服饰局部信息;构建的全局—局部兼容性学习模块对时尚服饰的全局兼容性和局部兼容性进行了完整建模;构建的时尚服饰搭配优化模型实现了全局和局部兼容性的优化组合,使时尚服饰搭配准确率(fill in the blank,FITB)提高至86.89%。结论 本文提出的全局—局部特征优化的时尚服饰搭配方法,能够有效提高时尚服饰搭配的准确率,较好地满足日常时尚搭配的需求。;Objective Fashion clothing matching has been developing for clothing-relevant fashion research nowadays. Fashion clothing matching studies are required to learn the complex matching relationship(i. e. ,fashion compatibility) among different fashion items in an representation-based outfit. Fashion items have rich partial designs and matching relationships among partial designs. To analyze their global compatibility learning,most of the existing researches are concerned of items’global features(visual and textual features). But,local feature extraction is often ignored for local compatibility,which causes lower performance and accuracy of fashion style matching. Therefore,we develop a fashion style matching method in terms of global-local feature optimization and it is aimed to extract the local features of fashion images for local information representing,construct the local compatibility of fashion items,and improve the global and local compatibility-incorporated accuracy of fashion style matching. Method First,we use two different convolutional neural networks(CNNs)to extract the global features of fashion items separately on the basis of the input fashion images and texts. To extract CNN-based local features of fashion images,a multiple branches-related local feature extraction network is designed. A branch of the local feature extraction network is composed of 1)a convolution layer,2)a batch normalization (BN)layer,and 3)a rectified linear unit(ReLU)activation function. A branch can be used to extract a local feature in the fashion image,and different branches can be used to extract different local features of the fashion image. Second,a global-local compatibility learning module is constructed in terms of graph neural network(GNN)and self-attention mechanism(SAM),which can model both of the global and local compatibility. GNN is used to model interactions among global features and local features separately. The SAM-based weight information of different fashion items is defined and integrated into the modeling,and the item’s global and local compatibility are obtained both. Finally,a fashion clothing matching optimization model is built up to gain optimized matching results. The learned outfit global and local compatibility can be used to integrate all fashion items’global compatibility and local compatibility separately in an outfit. To optimize matching results,the trade-off parameters are then defined to adjust the impact of the outfit global compatibility and local compatibility on fashion style matching. At the same time,the matching score is calculated as well. Different matching schemes have different matching scores,and the optimized fashion style matching result is generated according to the highest score. Result The proposed method is validated on the public Polyvore dataset that includes fashion item images and textual descriptions. The details are presented as follows. The local features of fashion items extracted by our local feature extraction network can represent the fashion items’local information effectively without label-attributed supervision. Our global-local compatibility learning module can be used to learn the fashion item’s global compatibility and local compatibility at the same time,and the weights of different fashion items is involved,which can model the fashion global and local compatibility completely. The fill in the blank(FITB)accuracy ratio of fashion style matching is improved to 86. 89%. Conclusion A fashion clothing matching method is developed in terms of global local feature optimization. First,we construct a local feature extraction network to extract local features of fashion images while the global features of fashion items are extracted. Next,the self-attention mechanism is introduced to weight different fashion items after the global matching relationships and local matching relationships of fashion items with graph network are analyzed,which constructs the global and local compatibilities of fashion items completely. Finally,to obtain the global and local compatibilities of the outfit, our fashion clothing matching optimization model is used to fuse the item’s global compatibility and local compatibility each in an outfit. To optimize matching results,the effectiveness of the two kinds of compatibilities on fashion clothing matching with parameters is adjusted as well. The convergence speed of our method is still slow. The optimization model is only used to combine the global and local compatibility of the outfit linearly. In practice,the relationship between the global compatibility and local compatibility is more complex. To improve the accuracy of fashion clothing matching,future work can be focused on the convergence speed and the clothing matching optimization further.
- Book Chapter
4
- 10.1007/978-3-031-26284-5_13
- Jan 1, 2023
Image retrieval is the task of finding all images in the data-base that are similar to a query image. Two types of image representations have been studied to address this task: global and local image features. Those features can be extracted separately or jointly in a single model. State-of-the-art methods usually learn them with Convolutional Neural Networks (CNNs) and perform retrieval with multi-scale image representation. This paper’s main contribution is to unify global and local features with Vision Transformers (ViTs) and multi-atrous convolutions for high-performing retrieval. We refer to the new model as ViTGaL, standing for Vision Transformer based Global and Local features (ViTGaL). Specifically, we add a multi-atrous convolution to the output of the transformer encoder layer of ViTs to simulate the image pyramid used in standard image retrieval algorithms. We use class attention to aggregate the token embeddings output from the multi-atrous layer to get both global and local features. The entire network can be learned end-to-end, requiring only image-level labels. Extensive experiments show the proposed method outperforms the state-of-the-art methods on the Revisited Oxford and Paris datasets. Our code is available at here
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.