Semantic Representation Research Articles

Background. Code search aims to find the most relevant code snippet in a large codebase based on a given natural language query. An accurate code search engine can increase code reuse and improve programming efficiency. The focus of code search is how to represent the semantic similarity of code and query. With the development of code pre-trained models, the pattern of using numeric feature vectors (embeddings) to represent code semantics and using vector distance to represent semantic similarity has replaced traditional string matching methods. The quality of semantic representations is critical to the effectiveness of downstream tasks such as code search. Currently, the state-of-the-art (SOTA) learning method uses the contrastive learning paradigm. The objective of contrastive learning is to maximize the similarity between matching code and query (positive samples) and minimize the similarity between mismatched pairs (negative samples). To increase the reusing of negative samples, prior contrastive learning approaches use a large queue (memory bank) to store embeddings. Problem. However, there is still a lot of room for improvement in using negative examples for code search: ① Due to the random selection of negative samples, semantic representations learned by existing models cannot distinguish similar codes well. ② Since semantic vectors in the memory bank are reused from previous inference results and then directly used for loss function calculation without gradient descent, the model cannot effectively learn the negative sample semantic information. Method. To solve the above problems, we propose a contrastive learning code search model with hard negative mining called CoCoHaNeRe: ❶ To enable the model to distinguish similar codes, we introduce hard negative examples into contrastive training, which are negative examples in the codebase that are most similar to positive examples. As a result, hard negative examples are most likely to make the model make mistakes. ❷ To improve the learning efficiency of negative samples during training, we add all hard negative examples to the model's gradient descent process. Result. To verify the effectiveness of CoCoHaNeRe, we conducted experiments on large code search datasets with six programming languages, as well as similar retrieval tasks code clone detection and code question answering. Experimental results show that our model achieves SOTA performance. In the code search task, the average MRR score of CoCoHaNeRe exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 11.25%, 8.13%, and 7.38%, respectively. It has also made great progress in code clone detection and code question answering. In addition, our method performs well in different programming languages and code pre-training models. Furthermore, qualitative analysis shows that our model effectively distinguishes high-order semantic differences between similar codes.

Read full abstract

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.

Read full abstract

Semantic Representation Research Articles

Related Topics

Articles published on Semantic Representation

Def2Vec: you shall know a word by its definition

Exploring online public survey lifestyle datasets with statistical analysis, machine learning and semantic ontology

Design method based on extensible semantic representation algorithm and its application in product packaging design

A Unified Framework for Multi-Domain CTR Prediction via Large Language Models

Effective Hard Negative Mining for Contrastive Learning-based Code Search

Hierarchical Event-RGB Interaction Network for single-eye expression recognition

Language proficiency is associated with neural representational dimensionality of semantic concepts

LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

A behaviouristic semantic approach to blockchain-based e-commerce

Word Learning in Arabic Diglossia in Children With Typical Language Development and Developmental Language Disorder.

Heterogeneous graph contrastive learning with adaptive data augmentation for semi‐supervised short text classification

Multi-granularity self-attention mechanisms for few-shot Learning

An embedded computational framework of memory: Accounting for the influence of semantic information in verbal short-term memory

The Impact of Lexical-semantic Impairment on Spoken Verb Production in Individuals With Mild Cognitive Impairment.

Probing the link between vision and language in material perception using psychophysics and unsupervised learning.

Microstate Analyses to Study face Processing in Healthy Individuals and Psychiatric Disorders: A Review of ERP Findings.

HeGCL: Advance Self-Supervised Learning in Heterogeneous Graph-Level Representation.

Differential Mnemonic Contributions of Cortical Representations during Encoding and Retrieval.

BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention

Rumor Detection With Hierarchical Representation on Bipartite Ad Hoc Event Trees.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Semantic Representation Research Articles

Related Topics

Articles published on Semantic Representation

Def2Vec: you shall know a word by its definition

Exploring online public survey lifestyle datasets with statistical analysis, machine learning and semantic ontology

Design method based on extensible semantic representation algorithm and its application in product packaging design

A Unified Framework for Multi-Domain CTR Prediction via Large Language Models

Effective Hard Negative Mining for Contrastive Learning-based Code Search

Hierarchical Event-RGB Interaction Network for single-eye expression recognition

Language proficiency is associated with neural representational dimensionality of semantic concepts

LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

A behaviouristic semantic approach to blockchain-based e-commerce

Word Learning in Arabic Diglossia in Children With Typical Language Development and Developmental Language Disorder.

Heterogeneous graph contrastive learning with adaptive data augmentation for semi‐supervised short text classification

Multi-granularity self-attention mechanisms for few-shot Learning

An embedded computational framework of memory: Accounting for the influence of semantic information in verbal short-term memory

The Impact of Lexical-semantic Impairment on Spoken Verb Production in Individuals With Mild Cognitive Impairment.

Probing the link between vision and language in material perception using psychophysics and unsupervised learning.

Microstate Analyses to Study face Processing in Healthy Individuals and Psychiatric Disorders: A Review of ERP Findings.

HeGCL: Advance Self-Supervised Learning in Heterogeneous Graph-Level Representation.

Differential Mnemonic Contributions of Cortical Representations during Encoding and Retrieval.

BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention

Rumor Detection With Hierarchical Representation on Bipartite Ad Hoc Event Trees.