In industrial settings, measuring the quality of data used to represent an intended domain of use and its operating conditions is crucial and challenging. Thus, this paper aims to present a set of metrics addressing this data quality issue in the form of a library, named DQM (Data Quality Metrics), for Machine Learning (ML) use. Additional metrics specific to industrial application are developed in the proposed library. This work aims also to assess various data and datasets types. Those metrics are used to characterize the training and evaluating datasets involved in the process of building ML models for industrial use cases. Two categories of metrics are implemented in DQM: inherent data metrics, are the ones evaluating the quality of a given dataset independently from the ML model such as statistical proprieties and attributes, and model dependent metrics which are those implemented to measure the quality of the dataset by considering the ML model outputs such the gap between two datasets in regards to a given ML model. DQM is used in the scope of the Confiance.ai program to evaluate datasets used for industrial purposes such as autonomous driving.
Read full abstract