Evaluation of the relative performance of different dose-finding designs is typically carried out using extensive simulations averaged over broad ranges of potential scenarios. Unfortunately, such evaluation lacks rigor, is not reliable and, in general, is not reproducible. Under identical experimental conditions one study will show that method B improves upon method A, another study will show the converse. This does not allow the user to proceed with the conviction of having made the more suitable choice. The source of the problem stems from our failure to address the fact that, in practice, we are dealing with heterogeneous populations. Leaning upon the thinking, and tools, of causal inference, we show here how to address the issue of heterogeneity and how to obtain rigorous, reliable and reproducible evaluation. Causal evaluation allows us to eliminate biases and unneeded sources of variability. The focus changes so that we look at the individual studies themselves rather than average behavior over many studies, i.e., we make comparisons before we average rather than the usual practice which is the other way around. Theoretical justifications and numerical illustrations are provided.