Abstract

Methods from supervised machine learning allow the classification of new data automatically and are tremendously helpful for data analysis. The quality of supervised maching learning depends not only on the type of algorithm used, but also on the quality of the labelled dataset used to train the classifier. Labelling instances in a training dataset is often done manually relying on selections and annotations by expert analysts, and is often a tedious and time-consuming process. Active learning algorithms can automatically determine a subset of data instances for which labels would provide useful input to the learning process. Interactive visual labelling techniques are a promising alternative, providing effective visual overviews from which an analyst can simultaneously explore data records and select items to a label. By putting the analyst in the loop, higher accuracy can be achieved in the resulting classifier. While initial results of interactive visual labelling techniques are promising in the sense that user labelling can improve supervised learning, many aspects of these techniques are still largely unexplored. This paper presents a study conducted using the mVis tool to compare three interactive visualisations, similarity map, scatterplot matrix (SPLOM), and parallel coordinates, with each other and with active learning for the purpose of labelling a multivariate dataset. The results show that all three interactive visual labelling techniques surpass active learning algorithms in terms of classifier accuracy, and that users subjectively prefer the similarity map over SPLOM and parallel coordinates for labelling. Users also employ different labelling strategies depending on the visualisation used.

Highlights

  • Labelling is assigning a class from the label alphabet to an instance in a multivariate dataset

  • Since prior studies have shown that users prefer t-SNE over PCA and MDS for interactive visual labelling, t-SNE (Maaten and Hinton, 2008) algorithm is used for the similarity map

  • This paper presented a study comparing three interactive visualisations with each other and with active learning for the purpose of labelling a multivariate dataset

Read more

Summary

Introduction

Labelling is assigning a class from the label alphabet to an instance (a record) in a multivariate dataset. 3516-8685 c Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2020 on a labelled dataset in order to perform These methods learn how to generalise new data, based on existing known data examples which are provided with a class label. Interactive visual labelling (VIAL) (Bernard et al, 2018c) tools build explorable visual overviews on top of active learning algorithms and can outperform classic active learning techniques in term of accuracy (Bernard et al, 2018a). Such combined tools allow an analyst to label a multivariate dataset in a visual environment, while receiving feedback and guidance from the system. Thereby, users can gain an understanding of which choices affect the classifiers, and contribute to understandable and explainable machine learning models

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call