AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects
Traceability plays a vital role in facilitating various software development activities by establishing the traces between different types of artifacts (e.g., issues and commits in software repositories). Among the explorations for automated traceability recovery, the IR (Information Retrieval)-based approaches leverage textual similarity to measure the likelihood of traces between artifacts and show advantages in many scenarios. However, the globalization of software development has introduced new challenges, such as the possible multilingualism on the same concept (e.g., "[SEE PDF]" vs. "attribute") in the artifact texts, thus significantly hampering the performance of IR-based approaches. Existing research has shown that machine translation can help address the term inconsistency in bilingual projects. However, the translation can also bring in synonymous terms that are not consistent with those in the bilingual projects (e.g., another translation of "[SEE PDF]" as "property"). Therefore, we propose an enhancement strategy called AVIATE that exploits translation variants from different translators by utilizing the word pairs that appear simultaneously across the translation variants from different kinds artifacts (a.k.a. consensual biterms). We use these biterms to first enrich the artifact texts, and then to enhance the calculated IR values for improving IR-based trace-ability recovery for bilingual software projects. The experiments on 17 bilingual projects (involving English and 4 other languages) demonstrate that AVIATE significantly outperformed the IR-based approach with machine translation (the state-of-the-art in this field) with an average increase of 16.67 in Average Precision (31.43%) and 8.38 (11.22%) in Mean Average Precision, indicating its effectiveness in addressing the challenges of multilingual traceability recovery.
- Research Article
14
- 10.1016/j.infsof.2013.08.004
- Sep 10, 2013
- Information and Software Technology
Enhancing software artefact traceability recovery processes with link count information
- Conference Article
61
- 10.1109/icpc.2009.5090038
- May 1, 2009
The intensive human effort needed to manually manage traceability information has increased the interest in utilising semi-automated traceability recovery techniques. This paper presents a simple way to improve the accuracy of traceability recovery methods based on Information Retrieval techniques. The proposed method acts on the artefact indexing considering only the nouns contained in the artefact content to define the semantics of an artefact. The rationale behind such a choice is that the language used in software documents can be classified as a sectorial language, where the terms that provide more indication on the semantics of a document are the nouns. The results of a reported case study demonstrate that the proposed artefact indexing significantly improves the accuracy of traceability recovery methods based on the probabilistic or vector space based IR models.
- Conference Article
11
- 10.1049/ic.2012.0014
- Jan 1, 2012
Background: Development of complex, software intensive systems generates large amounts of information. Several researchers have developed tools implementing information retrieval (IR) approaches to suggest traceability links among artifacts. Aim: We explore the consequences of the fact that a majority of the evaluations of such tools have been focused on benchmarking of mere tool output. Method: To illustrate this issue, we have adapted a framework of general IR evaluations to a context taxonomy specifically for IR-based traceability recovery. Furthermore, we evaluate a previously proposed experimental framework by conducting a study using two publicly available tools on two datasets originating from development of embedded software systems. Results: Our study shows that even though both datasets contain software artifacts from embedded development, the characteristics of the two datasets differ considerably, and consequently the traceability outcomes. Conclusions: To enable replications and secondary studies, we suggest that datasets should be thoroughly characterized in future studies on traceability recovery, especially when they can not be disclosed. Also, while we conclude that the experimental framework provides useful support, we argue that our proposed context taxonomy is a useful complement. Finally, we discuss how empirical evidence of the feasibility of IR-based traceability recovery can be strengthened in future research.
- Conference Article
12
- 10.1109/icsm.2009.5306317
- Sep 1, 2009
This paper presents a two-steps process aiming at improving the tracing performances of the software engineer when using an IR-based traceability recovery tool. In the first step the software engineer performs an incremental coarse-grained traceability recovery between a set of source artefacts and a set of target artefacts. During this step he/she traces as many links as possible keeping low the effort to discard false positives. In the second step he/she uses a coverage link analysis aiming at identifying source artefacts poorly traced and guiding focused fine-grained traceability recovery sessions to recover links missed in the first step. The results achieved in a reported controlled experiment demonstrate that the proposed approach significantly increases the amount of correct links traced by the software engineer with respect to a tradition process.
- Research Article
23
- 10.1016/j.infsof.2012.08.002
- Aug 24, 2012
- Information and Software Technology
Applying a smoothing filter to improve IR-based traceability recovery processes: An empirical investigation
- Conference Article
3
- 10.1109/csmr.2011.54
- Mar 1, 2011
Modern large-scale software development is a complex undertaking and coordinating various processes is crucial to achieve efficiency. The alignment between requirements and test activities is one very important aspect. Production and maintenance of software result in an ever-increasing amount of information. To be able to work efficiently under such circumstances, navigation in all available data needs support. Maintaining traceability links between software artifacts is one approach to structure the information space and support this challenge. Many researchers have proposed traceability recovery by applying information retrieval (IR) methods, utilizing the fact that artifacts often have textual content in natural language. Case studies have showed promising results, but no large-scale in vivo evaluations have been made. Currently, there is a trend among our industrial partners to move to a specific new software engineering tool. Their aim is to collect different pieces of information in one system. Our ambition is to develop an IR-based traceability recovery plug-in to this tool. From this position, right in the middle of a real industrial setting, many interesting observations could be made. This would allow a unique evaluation of the usefulness of the IR-based approach.
- Conference Article
1
- 10.14236/ewic/fdia2011.3
- Aug 1, 2011
- Electronic workshops in computing
Large-scale software development is a complex undertaking and generates an ever-increasing amount of information. To be able to work efficiently under such circumstances, navigation in all available data needs support. Maintaining traceability links between software artefacts is one approach to structure the information space and support this challenge. Several researchers have proposed traceability recovery by applying IR methods, based on textual similarities between artefacts. Early studies have shown promising results, but no large-scale in vivo evaluations have been made. Currently, there is a trend among our industrial partners to collect artefacts in a specific new software engineering tool. Our goal is to develop an IR-based traceability recovery plugin to this tool. From this position, in the environment of possible future users, the usefulness of supported findability in a software engineering context could be explored with an industrial validity.
- Conference Article
13
- 10.1109/icpc.2019.00055
- May 1, 2019
Traceability recovery allows developers to extract and comprehend the trace links among software artifacts (e.g., requirements and code). These trace links can provide important support to software maintenance and evolution tasks. Information Retrieval (IR) is now widely accepted as the key technique of semi-automatic tools to recover candidate trace links based on textual similarities among artifacts. However, the vocabulary mismatch problem between different artifacts hinders the performance of these IR-based approaches. Thus, a growing body of enhancing strategies were proposed based on user feedback. They allow to adjust the textual similarities of candidate links after users accept or reject part of these links. Recently, several approaches successfully used this strategy to improve the performance of IR-based traceability recovery. However, these approaches require a large amount of user feedback, which is infeasible in practice. In this paper, we propose to improve IR-based traceability recovery by introducing only a small amount of user feedback into the closeness analysis on call and data dependencies in code. Specifically, our approach iteratively asks users to verify a chosen candidate link based on the quantified functional similarity for each code dependency (called closeness) and the generated IR values. The verified link is then used as the input to re-rank the unverified candidate links. An empirical evaluation based on five real-world systems shows that our approach can outperform four baseline approaches by using only a small amount of user feedback.
- Book Chapter
89
- 10.1007/978-1-4471-2239-5_4
- Oct 31, 2011
The potential benefits of traceability are well known and documented, as well as the impracticability of recovering and maintaining traceability links manually. Indeed, the manual management of traceability information is an error prone and time consuming task. Consequently, despite the advantages that can be gained, explicit traceability is rarely established unless there is a regulatory reason for doing so. Extensive efforts have been brought forth to improve the explicit connection of software artifacts in the software engineering community (both research and commercial). Promising results have been achieved using Information Retrieval (IR) techniques for traceability recovery. IR-based traceability recovery methods propose a list of candidate traceability links based on the similarity between the text contained in the software artifacts. Software artifacts have different structures and the common element among many of them is the textual data, which most often captures the informal semantics of artifacts. For example, source code includes large volume of textual data in the form of comments and identifiers. In consequence, IR-based approaches are very well suited to address the traceability recovery problem. The conjecture is that artifacts with high textual similarity are good candidates to be traced to each other since they share several concepts. In this chapter we overview a general process of using IR-based methods for traceability link recovery and overview some of them in a greater detail: probabilistic, vector space, and Latent Semantic Indexing models. Finally, we discuss common approaches to measuring the performance of IR-based traceability recovery methods and the latest advances in techniques for the analysis of candidate links.
- Conference Article
87
- 10.1109/icsm.2006.32
- Sep 1, 2006
- Proceedings/Proceedings - Conference on Software Maintenance
Several authors apply Information Retrieval (IR) techniques to recover traceability links between software artefacts. Recently, the use of user feedbacks (in terms of classification of retrieval links as correct or false positives) has been proposed to improve the retrieval performances of these techniques. In this paper we present a critical analysis of using feedbacks within an incremental traceability recovery process. In particular, we analyse the trade-off between the improvement of the performances and the link classification effort required to train the IR-based traceability recovery tool. We also present the results achieved in case studies and show that even though the retrieval performances generally improve with the use of feedbacks, IR-based approaches are still far from solving the problem of recovering all correct links with a low classification effort.
- Conference Article
48
- 10.1109/icpc.2011.34
- Jun 1, 2011
Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.
- Research Article
15
- 10.5555/2821445.2821449
- May 16, 2015
Trace ability recovery allows software engineers to understand the interconnections among software artefacts and, thus, it provides an important support to software maintenance activities. In the last decade, Information Retrieval (IR) has been widely adopted as core technology of semi-automatic tools to extract trace ability links between artefacts according to their textual information. However, a widely known problem of IR-based methods is that some artefacts may share more words with non-related artefacts than with related ones. To overcome this problem, enhancing strategies have been proposed in literature. One of these strategies is relevance feedback, which allows to modify the textual similarity according to information about links classified by the users. Even though this technique is widely used for natural language documents, previous work has demonstrated that relevance feedback is not always useful for software artefacts. In this paper, we propose an adaptive version of relevance feedback that, unlike the standard version, considers the characteristics of both (i) the software artefacts and (ii) the previously classified links for deciding whether and how to apply the feedback. An empirical evaluation conducted on three systems suggests that the adaptive relevance feedback outperforms both a pure IR-based method and the standard feedback.
- Conference Article
17
- 10.1109/sst.2015.10
- May 1, 2015
Trace ability recovery allows software engineers to understand the interconnections among software artefacts and, thus, it provides an important support to software maintenance activities. In the last decade, Information Retrieval (IR) has been widely adopted as core technology of semi-automatic tools to extract trace ability links between artefacts according to their textual information. However, a widely known problem of IR-based methods is that some artefacts may share more words with non-related artefacts than with related ones. To overcome this problem, enhancing strategies have been proposed in literature. One of these strategies is relevance feedback, which allows to modify the textual similarity according to information about links classified by the users. Even though this technique is widely used for natural language documents, previous work has demonstrated that relevance feedback is not always useful for software artefacts. In this paper, we propose an adaptive version of relevance feedback that, unlike the standard version, considers the characteristics of both (i) the software artefacts and (ii) the previously classified links for deciding whether and how to apply the feedback. An empirical evaluation conducted on three systems suggests that the adaptive relevance feedback outperforms both a pure IR-based method and the standard feedback.
- Conference Article
78
- 10.1109/csmr.2013.29
- Mar 1, 2013
Information Retrieval (IR) has been widely accepted as a method for automated traceability recovery based on the textual similarity among the software artifacts. However, a notorious difficulty for IR-based methods is that artifacts may be related even if they are not textually similar. A growing body of work addresses this challenge by combining IR-based methods with structural information from source code. Unfortunately, the accuracy of such methods is highly dependent on the IR methods. If the IR methods perform poorly, the combined approaches may perform even worse. In this paper, we propose to use the feedback provided by the software engineer when classifying candidate links to regulate the effect of using structural information. Specifically, our approach only considers structural information when the traceability links from the IR methods are verified by the software engineer and classified as correct links. An empirical evaluation conducted on three systems suggests that our approach outperforms both a pure IR-based method and a simple approach for combining textual and structural information.
- Research Article
85
- 10.1007/s10664-008-9090-8
- Nov 7, 2008
- Empirical Software Engineering
We report the results of a controlled experiment and a replication performed with different subjects, in which we assessed the usefulness of an Information Retrieval-based traceability recovery too...