Machine learning based Synthetic Data Generation using Iterative Regression Analysis

Sanskar Shah,Darshan Gandhi,Jil Kothari

doi:10.1109/iceca49313.2020.9297491

Abstract

Machine learning has made a drastic impact in today’s world. Developments in machine learning are happening every day at an exponential rate. However, there are still some fields that are comparatively untouched by its impact. Areas like the medical and sports sector that could benefit immensely by utilizing the advancements in Machine Learning and still are lagging solely because of one crucial reason: unavailability of data. The unavailability of data results in scarcity of data used for training the machine learning models, which directly affects the accuracy of the models, making them less reliable for real-time usage. To counter this roadblock, this paper is proposing a solution to generating synthetic data in this paper. As the name suggests, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. It’s the primary purpose; therefore, it is to be flexible and rich enough to help a Machine Learning practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Thus, using this approach, iterative regression analysis was applied to generate synthetic data using a data set that was used in the field of sports. The generated data was then used along with the original dataset to train a new model that brought about a significant increase in the accuracy of the model to predict features.

Full Text