Abstract

In any data science and analytics project, the task of mapping a domain-specific problem to an adequate set of data mining methods by experts of the field is a crucial step. However, these experts are not always available and data mining novices may be required to perform the task. While there are several research efforts for automated method selection as a means of support, only a few approaches consider the particularities of problems expressed in the natural and domain-specific language of the novice. The study proposes the design of an intelligent assistance system that takes problem descriptions articulated in natural language as an input and offers advice regarding the most suitable class of data mining methods. Following a design science research approach, the paper (i) outlines the problem setting with an exemplary scenario from industrial practice, (ii) derives design requirements, (iii) develops design principles and proposes design features, (iv) develops and implements the IT artifact using several methods such as embeddings, keyword extractions, topic models, and text classifiers, (v) demonstrates and evaluates the implemented prototype based on different classification pipelines, and (vi) discusses the results’ practical and theoretical contributions. The best performing classification pipelines show high accuracies when applied to validation data and are capable of creating a suitable mapping that exceeds the performance of joint novice assessments and simpler means of text mining. The research provides a promising foundation for further enhancements, either as a stand-alone intelligent assistance system or as an add-on to already existing data science and analytics platforms.

Highlights

  • Data science and analytics (DSA) projects are generally multidisciplinary and require combined expertise from several areas, such as profound domain knowledge, analytical modeling skills, and experience in collecting and processing data from heterogeneous IT systems (Mikalef and Krogstie 2019)

  • We collected 60 different real-world problem statements, which are distributed among the three target classes, based on problem descriptions derived from own industrial DSA projects as well as selected data mining (DM) competitions from online platforms such as Kaggle

  • When gathering the set of problem statements, we paid attention to ensure (i) that the underlying scenarios originated from a wide range of application domains, (ii) that the keywords and key phrases for signalizing a specific class of DM method contained sufficient degree of variability, and (iii) that the descriptions were provided with a varying degree of filling information and noise

Read more

Summary

Introduction

Data science and analytics (DSA) projects are generally multidisciplinary and require combined expertise from several areas, such as profound domain knowledge, analytical modeling skills, and experience in collecting and processing data from heterogeneous IT systems (Mikalef and Krogstie 2019). Despite improved tool support, one crucial step still remains a challenging task throughout the DSA implementation process: The mapping between (i) the problem space expressed in the language and the concepts of the domain-specific problem setting, and (ii) the class of generic DM methods providing an algorithmic solution for data-driven decision support (Choinski and Chudziak 2009; Eckert and Ehmke 2017). This step requires a translation that determines the character of the subsequent DSA implementation process and, the success of the whole project (Hogl 2003). The translation is carried out by well-trained DSA experts, who bring the necessary skills to merge both contexts, that is the methodical skills needed for a typical data lifecycle as well as the required business understanding to grasp the underlying problem characteristics and achieve the desired outcome towards economic goals (Debortoli et al 2014; Schumann et al 2016)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call