An empirical study on the challenges that developers encounter when developing Apache Spark applications

Zehao Wang,Tse-Hsun (Peter) Chen,Haoxiang Zhang,Shaowei Wang

doi:10.1016/j.jss.2022.111488

Abstract

Apache Spark is one of the most popular big data frameworks that abstract the underlying distributed computation details. However, even though Spark provides various abstractions, developers may still encounter challenges related to the peculiarity of distributed computation and environment. To understand the challenges that developers encounter, and provide insight for future studies, in this paper, we conduct an empirical study on the questions that developers encounter. We manually analyze 1,000 randomly selected questions that we collected from Stack Overflow. We find that: 1) questions related to data processing (e.g., transforming data format) are the most common among the 11 types of questions that we uncovered. 2) Even though data processing questions are the most common ones, they require the least amount of time to receive an answer. Questions related to configuration and performance require the most time to receive an answer. 3) Most of the issues are caused by developers’ insufficient knowledge in API usages, data conversation across frameworks, and environment-related configurations. We also discuss the implication of our findings for researchers and practitioners. In summary, our work provides insights for future research directions and highlight the need for more software engineering research in this area.

Full Text