Using Natural Language Processing to Extract Information from Unstructured code-change version control data: lessons learned

Elisabetta Ronchieri,Marco Canaparo,Yue Yang

doi:10.22323/1.378.0025

Abstract

Context: Natural Language Processing (NLP) is a branch of artificial intelligence that extracts information from language. In the field of software engineering, NLP has been employed to extract key information from free-form text, to generate models from the analysis of text or to categorize code changes according to their commit messages. In literature, most of the approaches NLP-based focused on the impact of code changes on program execution or software architecture. Objective: In this study, we have applied NLP to code-change data to identify patterns of software code modifications and used Machine Learning techniques to build a model that determines how software has evolved over time and identifies area of code that presents problems. Method: Considering that software projects use version control systems, such as github, to manage their code, we have collected software information by using git commands. These data contain different unstructured information about the various files in a project. Each modification entry includes a message that explains the reasons for the change. According to the content of the message, it is possible to identify key terms that can be used during the classification of the entries. Results: In this study, we have considered the change history of software available on github to the High Energy Physics community. With the use of NLP techniques we have cleaned the messages and extracted some key terms to categorize both software problems and some other changes performed by developers, like the addition of a third party dependency or a script that starts a given service. We have built a code change dictionary combining the terms already in existing literature with the ones gathered directly from the software and its github repository. Finally, we have applied some Machine Learning (ML) techniques to determine any connection between code changes and software problems: we have removed redundant entries to avoid any bias in the outcomes of the ML techniques. Conclusion: We show in detail our approach adopted to construct historical code change datasets of categorized commit messages by following a multi-label classification methodology. Our model performance seems promising in terms of accuracy, precision, recall and F1-score.

Full Text