Abstract

One major challenge in machine learning applications is coping with mismatches between the datasets used in the development and those obtained in real-world applications. These mismatches may lead to inaccurate predictions and errors, resulting in poor product quality and unreliable systems. To address the mismatches, it is important to understand in what sense the two datasets differ. In this study, we propose StyleDiff to inform developers about the types of mismatches between the two datasets in an unsupervised manner. Given two unlabeled image datasets, StyleDiff automatically extracts latent attributes that are distributed differently between the given datasets and visualizes the differences in a human-understandable manner. For example, for an object detection dataset, latent attributes might include the time of day, weather, and traffic congestion of an image that are not explicitly labeled. StyleDiff helps developers understand the differences between the datasets with respect to such latent attribute distributions. Developers can then, for example, collect additional development data with these attributes and conduct additional tests for these attributes to enhance reliability. We demonstrate that StyleDiff accurately detects differences between datasets and presents them in an understandable format using, for example, driving scene datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call