Large Language Models (LLMs) have become instrumental in advancing software engineering (SE) tasks, showcasing their efficacy in code understanding and beyond. AI code models has demonstrated their value not only in code generating but also in defect detection, enhancing security measures, and improving overall software quality. They are emerging as crucial tools for both software development and maintaining software quality. Like traditional SE tools, open-source collaboration is key in realising the excellent products. However, with AI models, the essential need is in data. The collaboration of these AI-based SE models hinges on maximising the sources of high-quality data. However, data especially of high quality, often holds commercial or sensitive value, making it less accessible for open-source AI-based SE projects. This reality presents a significant barrier to the development and enhancement of AI-based SE tools within the software engineering community. Therefore, researchers need to find solutions for enabling open-source AI-based SE models to tap into resources by different organizations. Addressing this challenge, our position paper investigates one solution to facilitate access to diverse organizational resources for open-source AI models, ensuring privacy and commercial sensitivities are respected. We introduce a governance framework centered on federated learning (FL), designed to foster the joint development and maintenance of open-source AI code models while safeguarding data privacy and security. Additionally, we present guidelines for developers on AI-based SE tool collaboration, covering data requirements, model architecture, updating strategies, and version control. Given the significant influence of data characteristics on federated learning, our research examines the effect of code data heterogeneity on federated learning performance. We consider 6 different scenarios of data distributions and include 4 code models. We also include 4 most common federated learning algorithms. Our experimental findings highlight the potential for employing federated learning in the collaborative development and maintenance of AI-based software engineering models. We also discuss the key issues to be addressed in the co-construction process and future research directions.
Read full abstract