Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Deeply fusing transformer model and information retrieval with cross attention for source code summarization

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Deeply fusing transformer model and information retrieval with cross attention for source code summarization

Similar Papers
  • Research Article
  • Cite Count Icon 13
  • 10.1145/3631975
Deep Is Better? An Empirical Comparison of Information Retrieval and Deep Learning Approaches to Code Summarization
  • Mar 15, 2024
  • ACM Transactions on Software Engineering and Methodology
  • Tingwei Zhu + 6 more

Code summarization aims to generate short functional descriptions for source code to facilitate code comprehension. While Information Retrieval (IR) approaches that leverage similar code snippets and corresponding summaries have led the early research, Deep Learning (DL) approaches that use neural models to capture statistical properties between code and summaries are now mainstream. Although some preliminary studies suggest that IR approaches are more effective in some cases, it is currently unclear how effective the existing approaches can be in general, where and why IR/DL approaches perform better, and whether the integration of IR and DL can achieve better performance. Consequently, there is an urgent need for a comprehensive study of the IR and DL code summarization approaches to provide guidance for future development in this area. This article presents the first large-scale empirical study of 18 IR, DL, and hybrid code summarization approaches on five benchmark datasets. We extensively compare different types of approaches using automatic metrics, we conduct quantitative and qualitative analyses of where and why IR and DL approaches perform better, respectively, and we also study hybrid approaches for assessing the effectiveness of integrating IR and DL. The study shows that the performance of IR approaches should not be underestimated, that while DL models perform better in predicting tokens from method signatures and capturing structural similarities in code, simple IR approaches tend to perform better in the presence of code with high similarity or long reference summaries, and that existing hybrid approaches do not perform as well as individual approaches in their respective areas of strength. Based on our findings, we discuss future research directions for better code summarization.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/i2ct.2017.8226300
Filtering of false positives from IR-based traceability links among software artifacts
  • Apr 1, 2017
  • Jyoti + 1 more

Correlation among software artifacts (also known as traceability links) of object oriented software plays a vital role in its maintenance. These traceability links are being commonly identified through Information Retrieval (IR) based techniques. But, it has been found that the resulting links from IR contain many false positives and some complementary approaches have been suggested for the purpose. Still, it usually requires manual verification of links which is neither desirable nor reliable. This paper suggests a new technique which can automatically filter out the false positives links (between requirement and source code) from IR and thus can help in reducing dependence as well as incorrectness of manual verification process. The proposed approach works on the basis of finding correlations among classes using either structural or co-changed dependency or both. A threshold is selected as a cut off on computed dependency values, to accept the presence of structural and co-changed dependency each. Now the traceability links are verified using these dependencies. If atleast one of the structural or co-change information validates the link obtained from IR approach, then that link is selected as candidate link, otherwise removed. Different thresholds have been experimented and comparison of results obtained from IR and the proposed approach is done. The results show that precision increases for all values of thresholds. Further analysis of results indicates that threshold in the range of 0.3 to 0.5 give better results. Hence, the proposed approach can be used as complementary to other Improved IR approaches to filter out false positives.

  • PDF Download Icon
  • Research Article
  • 10.4172/2229-8711.1000178
Software Dependency Estimation in the code Repositories for the Requirement Evolution
  • Jan 1, 2015
  • Global Journal of Technology and Optimization
  • Karthikeyan Balasubramanian + 1 more

Dependency is the only means to ensure that the source code of a system is consistent with its requirements. During software maintenance and evolution, requirement dependency links become obsolete because dependency model is been not trained properly to updating them. Yet, recovering these dependency links later is a daunting and costly task for building the model for unsupervised enhancements. Consequently, the literature has proposed methods, techniques, and tools to recover these dependency links semi-automatically or automatically. Among the proposed techniques, the literature showed that information retrieval (IR) techniques can automatically recover traceability links between free-text requirements and source code through classification techniques to the Software repositories. However, IR techniques lack accuracy (precision and recall) in terms of Text and concept based mining also leads to code sense disambiguation. In this paper, we show that Semantic mining of software repositories and combining mined results with IR can improve the accuracy (precision and recall) of IR techniques. We apply Dependency Estimation on to compare the accuracy of its dependency links with those recovered using state-of-the-art IR techniques from Vector Space model and Concept based mining. We thus show that mining software repositories and combining the mined data with existing results from IR techniques improves the precision and recall of requirement dependency links.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/dcabes52998.2021.00043
Traceability method between design documents and source codes based on SQL dependency
  • Dec 1, 2021
  • Lujing Yu + 3 more

Information Retrieval (IR) technology was widely used in traceability between design documents and source codes. However, the vocabulary mismatch between the design documents and the source codes affects the performance of IR. Aiming at the above situation, a dynamic tracing method from design documents to source codes combining IR technology and SQL statement is proposed in management information system. Firstly, the similarity of the two is calculated by IR and the candidate links are generated; Then, the SQL statement required by the codes is automatically estimated according to the design documents, and the SQL statement is compared with the actual SQL statement in the codes to correct the design documents-codes similarity score; Finally, set a threshold to determine the trace links of the design documents to the source codes. The experimental results show that this method can improve the similarity score of code classes with relevant SQL statements in the design documents, so as to improve the ranking of code classes in the candidate links, extract the trace links that may be missing in IR method under the action of threshold, and finally improve the precision of trace results.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-981-10-5780-9_10
Requirements Traceability Through Information Retrieval Using Dynamic Integration of Structural and Co-change Coupling
  • Jan 1, 2017
  • Jyoti + 1 more

Requirement Traceability (RT) links correlate requirements to their corresponding source code and helps in better requirement understanding, reusability and other software maintenance activities. Since a major portion of software artifacts is in the form of text, for finding these links Information Retrieval (IR) techniques based on textual similarity are widely adopted for Requirement Traceability. But it is hard to find RT links when artifacts have less textual description. So, for finding these links indirectly non-textual techniques like structural information based, co-change history based, ownership based are used with IR. However, if the results of IR contain false positives, the combined approach may increase them further. So, instead of directly combining, this paper proposes an automatic technique for RT by first improving the IR approach and then combining it with the non-textual based techniques. Also, we present a new non-textual based technique based on weighted integration of structural coupling and change history based coupling of classes for retrieving indirect links. The results show that our proposed approach performs better than the existing methods which use coupling information complementary to IR.

  • Conference Article
  • Cite Count Icon 22
  • 10.1109/apsec.2003.1254359
Understanding how the requirements are implemented in source code
  • Dec 10, 2003
  • Wei Zhao + 4 more

For software maintenance and evolution, a common problem is to understand how each requirement is implemented in the source code. The basic solution of this problem is to find the fragment of source code that is corresponding to the implementation of each requirement. This can be viewed as a requirement-slicing problem - slicing the source code according to each individual requirement. We present an approach to find the set of functions that is corresponding to each requirement. The main idea of our method is to combine the information retrieval technology with the static analysis of source code structures. First, we retrieve the initial function sets through some information retrieval model using functional requirements as the queries and identifier information (such as function names, parameter names, variable names etc.) of functions in the source code as target documents. Then we complement each retrieved initial function set by analyzing the call graph extracted from the source code. A premise of our approach is that programmers should use meaningful names as identifiers. Furthermore, we perform an experimental study based on a GNU system. We use two basic metrics: precision and recall (which are the common practice in the information retrieval field), to evaluate our approach. We also compare the results directly acquired from information retrieval with those that are complemented through static source code structure analysis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 59
  • 10.3390/sym14030471
A Survey of Automatic Source Code Summarization
  • Feb 25, 2022
  • Symmetry
  • Chunyan Zhang + 6 more

Source code summarization refers to the natural language description of the source code’s function. It can help developers easily understand the semantics of the source code. We can think of the source code and the corresponding summarization as being symmetric. However, the existing source code summarization is mismatched with the source code, missing, or out of date. Manual source code summarization is inefficient and requires a lot of human efforts. To overcome such situations, many studies have been conducted on Automatic Source Code Summarization (ASCS). Given a set of source code, the ASCS techniques can automatically generate a summary described with natural language. In this paper, we give a review of the development of ASCS technology. Almost all ASCS technology involves the following stages: source code modeling, code summarization generation, and quality evaluation. We further categorize the existing ASCS techniques based on the above stages and analyze their advantages and shortcomings. We also draw a clear map on the development of the existing algorithms.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/qrs57517.2022.00099
RetCom: Information Retrieval-Enhanced Automatic Source-Code Summarization
  • Dec 1, 2022
  • Yubo Zhang + 3 more

With the purpose of saving the developing time of software engineers and promoting the work efficiency of programs, the research on automated source-code summarization (SCS) has become necessary in recent years, i.e. generating language descriptions for source code. To date, there exist two categories of SCS methods: information retrieval (IR)-based SCS and neural-based SCS. The latter is the mainstream method at present, however, this line of work suffers from the drawback of incapability to generate low-frequency words, which potentially degrades the performance. To tackle this predicament, we in this paper propose an IR-enhanced neural SCS method RetCom to improve the prediction of low-frequency words through leveraging both structural-level and semantic-level code retrievals. Furthermore, we figure out a token-level context-dependent mixture network to fuse different information sources, i.e. original code, structurally most similar code, and semantically most similar code. Finally, extensive experiments are performed to validate our proposed RetCom using two real-world datasets. Compared to several baseline methods, the experimental results show that our method does validly capture more low-frequency words to conduct a superior performance.

  • Research Article
  • Cite Count Icon 1
  • 10.1587/transinf.e95.d.205
Feature Location in Source Code by Trace-Based Impact Analysis and Information Retrieval
  • Jan 1, 2012
  • IEICE Transactions on Information and Systems
  • Zhengong Cai + 3 more

Feature location is to identify source code that implements a given feature. It is essential for software maintenance and evolution. A large amount of research, including static analysis, dynamic analysis and the hybrid approaches, has been done on the feature location problems. The existing approaches either need plenty of scenarios or rely on domain experts heavily. This paper proposes a new approach to locate functional feature in source code by combining the change impact analysis and information retrieval. In this approach, the source code is instrumented and executed using a single scenario to obtain the execution trace. The execution trace is extended according to the control flow to cover all the potentially relevant classes. The classes are ranked by trace-based impact analysis and information retrieval. The ranking analysis takes advantages of the semantics and structural characteristics of source code. The identified results are of higher precision than the individual approaches. Finally, two open source cases have been studied and the efficiency of the proposed approach is verified.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3897/biss.8.136735
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
  • Sep 10, 2024
  • Biodiversity Information Science and Standards
  • Vamsi Krishna Kommineni + 3 more

Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data and generate diverse insights. Biodiversity literature, with its broad range of topics, is no exception to this trend (Boyko et al. 2023, Castro et al. 2024). LLMs can help in information extraction and synthesis, text annotation and classification, and many other natural language processing tasks. We leverage LLMs to automate the information retrieval task from biodiversity publications, building upon data sourced from our previous work (Ahmed et al. 2024). In our previous work (Ahmed et al. 2023, Ahmed et al. 2024), we assessed the reproducibility of deep learning (DL) methods used in biodiversity research. We developed a manual pipeline to extract key information on DL pipelines—dataset, source code, open-source frameworks, model architecture, hyperparameters, software and hardware specs, randomness, averaging result and evaluation metrics from 61 publications (Ahmed et al. 2024). While this allowed analysis, it required extensive manual effort by domain experts, limiting scalability. To address this, we propose an automatic information extraction pipeline using LLMs with the Retrieval Augmented Generation (RAG) technique. RAG combines the retrieval of relevant documents with the generative capabilities of LLMs to enhance the quality and relevance of the extracted information. We employed an open-source LLM, Hugging Face implementation of Mixtral 8x7B (Jiang et al. 2024), a mixture of expert models in our pipeline (Fig. 1) and adapted the RAG pipeline from earlier work (Kommineni et al. 2024). The pipeline was run on a single NVIDIA A100 40GB graphics processing unit with 4-bit quantization. To evaluate our pipeline, we compared the expert-assisted manual approach with the LLM-assisted automatic approach. We measured their consistency using the inter-annotator agreement (IAA) and quantified it with the Cohen Kappa score (Pedregosa et al. 2011), where a higher score indicates more reliable and aligned outputs (1: maximum agreement, -1: no agreement). The Kappa score among human experts (annotators 1 and 2) was 0.54 (moderate agreement), while the scores comparing human experts with the LLM were 0.16 and 0.12 (slight agreement). The difference is partly due to human annotators having access to more information (including code, dataset, figures, tables and supplementary materials) than the LLM, which was restricted to the text itself. Given these restrictions, the results are promising but also show the potential to improve them by adding further modalities to the LLM inputs. Future work will involve several key improvements to our LLM-assisted information retrieval pipeline: Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Leveraging LLMs to automate information retrieval from biodiversity publications signifies a notable advancement in the scalable and efficient analysis of biodiversity literature. Initial results show promise, yet there is substantial potential for enhancement through the integration of multimodal data, optimized retrieval mechanisms, and comprehensive evaluation. By addressing these areas, we aim to improve the accuracy and utility of our pipeline, ultimately enabling broader and more in-depth analysis of biodiversity literature.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/seaa.2013.65
Feature-to-Code Traceability in Legacy Software Variants
  • Sep 1, 2013
  • Hamzeh Eyal-Salman + 2 more

Existing similar software variants, developed by ad-hoc reuse technique such as left clone-and-own right, represent a starting point to build a software product line (SPL) core assets. To re-engineer such legacy software variants into an SPL for systematic reuse, it is important to be able to identify a mapping between features and their implementing source code elements in different variants. Information Retrieval (IR) methods have been used widely to support this mapping in a single software product. This paper proposes a new approach to improve the performance of IR methods when they are applied to a collection of software variants. The novelty of our approach is twofold. On the one hand, it exploits what software variants have in common and how they differ to improve the accuracy of IR results. On the other hand, it reduces the abstraction gap between features and source code by introducing an intermediate level called left code-topic right, for increasing the number of retrieved links that are relevant. We have applied our approach to a collection of seven variants of a large-scale system by using the ArgoUML-SPL modeling tool. The experimental results showed that our approach outperforms conventional application of IR methods as well as the most recent and relevant work on the subject.

  • Conference Article
  • Cite Count Icon 63
  • 10.1145/2627508.2627514
Empirical studies on the NLP techniques for source code data preprocessing
  • May 26, 2014
  • Xiaobing Sun + 3 more

Program comprehension usually focuses on the significance of textual information to capture the programmers’ intent and knowledge in the software, in particular the source code. In the source code, most of the data is unstructured data, such as the natural language text in comments and identifier names. Researchers in software engineering community have developed many techniques for handling such unstructured data, such as natural language processing (NLP) and information retrieval (IR). Before using the IR technique on the unstructured source code, we must preprocess the text identifies and comments since these data is different from that used in our daily life. During this process, several operations, i.e, tokenization, splitting, stemming, etc. are usually used for preprocessing the unstructured source code. These preprocessing operations will affect the quality of the data used in the IR process. But how these preprocessing operations affect the results of IR is still an open problem. To the best of our knowledge, there are still no studies focusin on this problem. This paper attempts to fill this gap, and conducts some empirical studies to show what are the differences before and after these preprocessing operations. The results show some interesting phenomena based on using or not using these preprocessing operations.

  • Research Article
  • 10.1038/s41598-025-19628-w
Cognitive-inspired xLSTM for multi-agent information retrieval.
  • Oct 16, 2025
  • Scientific reports
  • Li Liang + 2 more

In the era of big data and complex information retrieval tasks, multi-agent systems play a crucial role in handling large-scale, complex queries across various domains. Traditional approaches, such as BERT, RoBERTa, and Transformer models, have been widely used in information retrieval. However, these methods often suffer from computational inefficiencies and limited coordination between agents when dealing with long-term dependencies and collaborative tasks. These limitations lead to suboptimal retrieval accuracy and increased processing time, especially in multi-agent environments. To address these challenges, we propose a cognitive-inspired xLSTM model, specifically designed for multi-agent information retrieval. The model introduces advanced memory mechanisms, shared memory structures, and dynamic gating functions, enabling effective long-term dependency management and enhanced agent collaboration. The xLSTM allows agents to exchange information efficiently, optimizing both retrieval speed and accuracy. Extensive experiments on four benchmark datasets-HotpotQA, APPS, MBPP, and FEVER-demonstrate that xLSTM significantly outperforms six state-of-the-art methods in terms of training time, inference time, and key performance metrics such as accuracy, recall, and F1 score. The proposed method not only improves retrieval performance but also enhances computational efficiency, making it a valuable solution for real-time, large-scale information retrieval tasks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3390/e24101372
Re_Trans: Combined Retrieval and Transformer Model for Source Code Summarization
  • Sep 27, 2022
  • Entropy
  • Chunyan Zhang + 5 more

Source code summarization (SCS) is a natural language description of source code functionality. It can help developers understand programs and maintain software efficiently. Retrieval-based methods generate SCS by reorganizing terms selected from source code or use SCS of similar code snippets. Generative methods generate SCS via attentional encoder–decoder architecture. However, a generative method can generate SCS for any code, but sometimes the accuracy is still far from expectation (due to the lack of numerous high-quality training sets). A retrieval-based method is considered to have a higher accurac, but usually fails to generate SCS for a source code in the absence of a similar candidate in the database. In order to effectively combine the advantages of retrieval-based methods and generative methods, we propose a new method: Re_Trans. For a given code, we first utilize the retrieval-based method to obtain its most similar code with regard to sematic and corresponding SCS (S_RM). Then, we input the given code and similar code into the trained discriminator. If the discriminator outputs onr, we take S_RM as the result; otherwise, we utilize the generate model, transformer, to generate the given code’ SCS. Particularly, we use AST-augmented (Syntax Tree) and code sequence-augmented information to make the source code semantic extraction more complete. Furthermore, we build a new SCS retrieval library through the public dataset. We evaluate our method on a dataset of 2.1 million Java code-comment pairs, and experimental results show improvement over the state-of-the-art (SOTA) benchmarks, which demonstrates the effectiveness and efficiency of our method.

  • Research Article
  • Cite Count Icon 97
  • 10.1145/2377656.2377660
Concept location using formal concept analysis and information retrieval
  • Nov 1, 2012
  • ACM Transactions on Software Engineering and Methodology
  • Denys Poshyvanyk + 2 more

The article addresses the problem of concept location in source code by proposing an approach that combines Formal Concept Analysis and Information Retrieval. In the proposed approach, Latent Semantic Indexing, an advanced Information Retrieval approach, is used to map textual descriptions of software features or bug reports to relevant parts of the source code, presented as a ranked list of source code elements. Given the ranked list, the approach selects the most relevant attributes from the best ranked documents, clusters the results, and presents them as a concept lattice, generated using Formal Concept Analysis.The approach is evaluated through a large case study on concept location in the source code on six open-source systems, using several hundred features and bugs. The empirical study focuses on the analysis of various configurations of the generated concept lattices and the results indicate that our approach is effective in organizing different concepts and their relationships present in the subset of the search results. In consequence, the proposed concept location method has been shown to outperform a standalone Information Retrieval based concept location technique by reducing the number of irrelevant search results across all the systems and lattice configurations evaluated, potentially reducing the programmers' effort during software maintenance tasks involving concept location.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant