Using influence functions to identify potential improvements for synthetic data generation

Steven Glandon,Rajeev Agrawal,Ruth Cheng,Andrew Maxwell,Myron E Hohil,Latasha Solomon,Tien Pham

doi:10.1117/12.2619021

Abstract

Computer vision, enabled by artificial intelligence and deep learning, has a nearly limitless number of possible applications, military and civilian. Object detection methods are a particularly notable type of computer vision, with broad usefulness in a variety of systems, such as autonomous vehicles, robotics, and security. Development of effective object detection methods faces many challenges; one such challenge of significance is a lack of good labeled data for the target domain, as hand labeled data is time-consuming and expensive to produce. Synthetic data generation seeks to solve this problem by programmatically generating both training data and labels simultaneously, allowing for the creation of arbitrarily large training datasets. However, synthetic data has several drawbacks { generating realistic imagery is challenging and computationally expensive, and models trained with synthetic data frequently suffer in accuracy when applied to real test data. In our research, we use model explainability techniques to connect model predictions back to the model training data, in order to identify the most important features that need to be represented accurately in synthetic training data. Influence functions score model training samples based on how influential each sample was to a particular prediction, by approximating the effect of retraining the model after leaving each individual training sample out of the training set. In this work, we seek to extend influence functions to identify the most valuable features in real and synthetic training data for use in improving our synthetic data generation tools.

Full Text