Abstract

Machine learning teaches computers to think in a similar way to how humans do. An ML models work by exploring data and identifying patterns with minimal human intervention. A supervised ML model learns by mapping an input to an output based on labeled examples of input-output (X, y) pairs. Moreover, an unsupervised ML model works by discovering patterns and information that was previously undetected from unlabelled data. As an ML project is an extensively iterative process, there is always a need to change the ML code/model and datasets. However, when an ML model achieves 70-75% of accuracy, then the code or algorithm most probably works fine. Nevertheless, in many cases, e.g., medical or spam detection models, 75% accuracy is too low to deploy in production. A medical model used in susceptible tasks such as detecting certain diseases must have an accuracy label of 98-99%. Furthermore, that is a big challenge to achieve. In that scenario, we may have a good working model, so a model-centric approach may not help much achieve the desired accuracy threshold. However, improving the dataset will improve the overall performance of the model. Improving the dataset does not always require bringing more and more data into the dataset. Improving the quality of the data by establishing a reasonable baseline level of performance, labeler consistency, error analysis, and performance auditing will thoroughly improve the model's accuracy. This review paper focuses on the data-centric approach to improve the performance of a production machine learning model.

Highlights

  • In academic and research settings, traditional ML modelling is less complicated

  • Working with data, that has a good percentage of improvement, improving data quality will improve the overall accuracy of the machine learning model

  • Sometimes there may have a big gap between the model's accuracy and human-level performance (HLP)

Read more

Summary

INTRODUCTION

In academic and research settings, traditional ML modelling is less complicated. Typically, some standard datasets are supplied, and they are most often cleaned and labeled. After achieving a certain level of accuracy, the big data does not help much to improve the performance further. We need a consistent and correctly labeled dataset, aka good data with a state-of-the-art model, to achieve this purpose. While data is collected from various sources using a data pipeline, it needs to go through extensive data cleaning and formatting processes These processes can filter and clean the data for a certain level, making them suitable for feeding a machine learning model. We can use those data to build a model and almost get around 60-70% accuracy. A Data-centric Approach to Improve Machine Learning Model’s Performance in Production

LABELER CONSISTENCY
ESTABLISHING BASELINE
ERROR ANALYSIS
DATA AUGMENTATION
HUMAN-LEVEL ERROR AS A PROXY FOR BAYES ERROR
PERFORMANCE AUDITING
Findings
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call