Abstract
This paper sketches a prehistory of synthetic data in the development of simulation technologies. Synthetic data is connected to simulation by the technical problem of the reality gap: the gap between the synthetic data a model is trained on and the real-world data it is deployed on. The reality gap is presented as a novelty both generated and solved by synthetic data. We demonstrate that the reality gap has plagued simulation technologies since their inception in the mid-20th century. We contend that the reality gap is not something synthetic data can solve. To illustrate this, we examine three episodes in the prehistory of synthetic data. These episodes are representative of three distinct regimes of simulation : (a) the statistical regime , (b) the discrete-event regime, and (c) the visual - interactive regime . Each regime reveals a reality gap; from before the advent of digital computers to the present. Synthetic data, like simulations, require data about a given domain in order to model it. It requires the real-world data which it purports to dispense with. The reality gap is thus an epistemological issue as well as a technical one. We argue that it is also a political economic issue: it complicates existing means of producing data, adding new layers of mediation and labor. Synthetic data thus indicates the emergence of an alternative stack for the production of AI systems. This suggests that the political economy of AI must take account of the proliferation of new technical means for creating data.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have