Software engineering practices for machine learning — Adoption, effects, and team assessment

Alex Serban,Koen Van Der Blom,Holger Hoos,Joost Visser

doi:10.1016/j.jss.2023.111907

Abstract

Machine learning (ML) is extensively used in production-ready applications, calling for mature engineering techniques to ensure robust development, deployment and maintenance. Given the potential negative impact machine learning (ML) can have on people, society or the environment, engineering techniques that can ensure robustness against technical errors and adversarial attacks are of considerable importance. In this work, we investigate how teams of experts develop, deploy and maintain software with ML components. Moreover, we link what teams do to the effects they aim to achieve and provide means for improvement. Towards this goal, we performed a mixed-methods study with a sequential exploratory strategy. First, we performed a systematic literature review through which we mined both academic and grey literature, and compiled a catalogue of engineering practices for ML. Second, we validated this catalogue using a large-scale survey, which measured the degree of adoption of the practices and their perceived effects. Third, we ran validation interviews with practitioners to add depth to the survey results. The catalogue covers a broad range of practices for engineering software systems with ML components and for ensuring non-functional properties that fall under the umbrella of trustworthy ML, such as fairness, security or accountability. Here, we present the results of our study, which indicate, for example, that larger and more experienced teams tend to adopt more practices, but that trustworthiness practices tend to be neglected. Moreover, we show that the effects measured in our survey, such as team agility or accountability, can be predicted quite accurately from groups of practices. This allowed us to contrast the importance of the practices for these effects as well as adoption rates, revealing, for example, that widely adopted practices are, in reality, less important with respect to some effects. For instance, writing reusable scripts for data cleaning and merging is highly adopted, but has a limited impact on reproducibility. Overall, our study provides a quantitative assessment of ML engineering practices and their impact on desirable properties of software with ML components, by which we open multiple avenues for improving the adoption of useful practices.Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.

Full Text