Knowledge Distillation-Based Multilingual Code Retrieval

Wen Li,Qi Chen,Junfei Xu

doi:10.3390/a15010025

Abstract

Semantic code retrieval is the task of retrieving relevant codes based on natural language queries. Although it is related to other information retrieval tasks, it needs to bridge the gaps between the language used in the code (which is usually syntax-specific and logic-specific) and the natural language which is more suitable for describing ambiguous concepts and ideas. Existing approaches study code retrieval in a natural language for a specific programming language, however it is unwieldy and often requires a large amount of corpus for each language when dealing with multilingual scenarios.Using knowledge distillation of six existing monolingual Teacher Models to train one Student Model—MPLCS (Multi-Programming Language Code Search), this paper proposed a method to support multi-programing language code search tasks. MPLCS has the ability to incorporate multiple languages into one model with low corpus requirements. MPLCS can study the commonality between different programming languages and improve the recall accuracy for small dataset code languages. As for Ruby used in this paper, MPLCS improved its MRR score by 20 to 25%. In addition, MPLCS can compensate the low recall accuracy of monolingual models when perform language retrieval work on other programming languages. And in some cases, MPLCS’ recall accuracy can even outperform the recall accuracy of monolingual models when they perform language retrieval work on themselves.

Highlights

The research on code retrieval can be divided into two broad categories according to the methods used: Information Retrieval-Based Methods and Deep Learning Model-Based
Information Retrieval-Based Methods are more based on traditional search methods, the main idea is to improve work based on text similarity, and to perform code retrieval through search techniques combined with code features
We have prepared a test set for each language and tested it on each monolingual model and Multi-Programming Language Code Search (MPLCS) respectively, the Mean Reciprocal Rank (MRR) results and SuccessRate@k results are shown in Tables 2 and 3 respectively

Summary

Introduction

The research on code retrieval can be divided into two broad categories according to the methods used: Information Retrieval-Based Methods and Deep Learning Model-Based. Vector Representations is a method to learn the connection between two heterogeneous structures, which maps data of two different structures into the same high-dimensional space [20], so that the corresponding data fall as close as possible to each other in the high-dimensional space, while making the non-corresponding data as far away from each other as possible. Such an approach makes the query process more intuitive, and when performing a search, it is only necessary to find some points in the high-dimensional space that are closest to the target point, i.e., the nearest neighbor problem in the highdimensional space. The code for bubble sort and “bubble sort” is mapped to relatively close locations, as is the case for “quick sort”

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Jan 17, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Knowledge Distillation-Based Multilingual Code Retrieval

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Deep Graph Matching and Searching for Semantic Code Retrieval
Xiang Ling ... Shouling Ji
ACM Transactions on Knowledge Discovery from Data | VOL. 15
Xiang Ling, et. al.Xiang Ling ... Shouling Ji
10 May 2021
ACM Transactions on Knowledge Discovery from Data | VOL. 15

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning
Wei Ye ... Jinglei Zhang
-
Wei Ye, et. al.Wei Ye ... Jinglei Zhang
20 Apr 2020
20 Apr 2020

Enhancing Semantic Code Search with Deep Graph Matching
Nazia Bibi ... Farkhanda Afzal
IEEE Access | VOL. -
Nazia Bibi, et. al.Nazia Bibi ... Farkhanda Afzal
01 Jan 2023
IEEE Access | VOL. -

Do natural language search engines really understand what users want?
Nadjla Hariri
Online Information Review | VOL. 37
Nadjla HaririNadjla Hariri
12 Apr 2013
Online Information Review | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Knowledge Distillation-Based Multilingual Code Retrieval

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms