Abstract
With the rise of AI in SE, researchers have shown how AI can be applied to assist software developers in a wide variety of activities. However, it has not been accompanied by a complementary increase in labelled datasets, which is required in many supervised learning methods. Several studies have been using crowdsourcing platforms to collect labelled training data in recent years. However, research has shown that the quality of labelled data is unstable due to participant bias, knowledge variance, and task difficulty. Thus, we present CodeLabeller, a web-based tool that aims to provide a more efficient approach in handling the process of labelling Java source files at scale by improving the data collection process throughout, and improving the degree of reliability of responses by requiring each labeller to attach a confidence rating to each of their responses. We test CodeLabeller by constructing a corpus of over a thousand source files obtained from a large collection of opensource Java projects, and labelling each Java source file with their respective design patterns and summaries. Apart from assisting researchers to crowdsource a labelled dataset, the tool has practical applicability in software engineering education and assists in building expert ratings for software artefacts. This paper discusses the motivation behind the creation of CodeLabeller, the intended users, a tool demonstration and its UI, its implementation, benefits, and lastly, the evaluation through a user study and in-practice usage.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have