Although there exists a rich literature on data lifecycle, a common framework for data lifecycle depicts reuse as the last stage. However, this framework fails to explain the complex lifecycle of machine learning (ML) data sets, which can have many different afterlives. Data sets for ML can be expanded to supplement previous research, and researchers can concatenate multiple data sets to develop new models. This study discusses ML dataset reuse through the lens of the data–information–knowledge–wisdom pyramid. In social science research, researchers might reuse data to analyse a new research question that is still in the context of the data domain. By contrast, research practices in ML, where researchers layer multiple data sets for training purposes, require us to ask whether the existing data lifecycle model, ending with ‘reuse’, is appropriate for explaining such an iterative and layered lifecycle. This study introduces one case of merging computer vision data set and natural language processing data set and two cases of applying ML models from outside of the ML community (hate speech detection and politeness detection) to justify a framework for a ML dataset lifecycle. Last but not least, this study proposes a ML dataset lifecycle and provides case examples to describe each stage.
Read full abstract