With reinforcement learning powered by big data and computer infrastructure, data-centric AI is driving a fundamental shift in the way software is developed. To treat data as a first-class citizen on par with code, software engineering must be rethought in this situation. One surprise finding is how much time is spent on data preparation throughout the machine learning process. Even the most powerful machine learning algorithms will struggle to perform adequately in the absence of high-quality data. Advanced technologies that are data-centric are being used more frequently as a result. Unfortunately, a lot of real-world datasets are small, unclean, biased, and occasionally even tainted. In this study, we focus on the scientific community for data collecting and data quality for deep learning applications. Data collection is essential since modern algorithms for deep learning rely mostly on large-scale data collecting than classification techniques. To enhance data quality, we investigate data validation, cleaning, and integration techniques. Even if the data cannot be completely cleaned, robust model training strategies enable us to work with imperfect data during training the model. Furthermore, despite the fact that that these issues have gotten less attention in conventional data management studies, bias and fairness are significant themes in modern application of machine learning. In order to prevent injustice, we investigate controls for fairness and strategies for doing so before, during, and after model training. We believe the information management community is in a good position to address these problems.
Read full abstract