Response surface methods, and global optimization techniques in general, are typically evaluated using a small number of standard synthetic test problems, in the hope that these are a good surrogate for real-world problems. We introduce a new, more rigorous methodology for evaluating global optimization techniques that is based on generating thousands of test functions and then evaluating algorithm performance on each one. The test functions are generated by sampling from a Gaussian process, which allows us to create a set of test functions that are interesting and diverse. They will have different numbers of modes, different maxima, etc., and yet they will be similar to each other in overall structure and level of difficulty. This approach allows for a much richer empirical evaluation of methods that is capable of revealing insights that would not be gained using a small set of test functions. To facilitate the development of large empirical studies for evaluating response surface methods, we introduce a dimension-independent measure of average test problem difficulty, and we introduce acquisition criteria that are invariant to vertical shifting and scaling of the objective function. We also use our experimental methodology to conduct a large empirical study of response surface methods. We investigate the influence of three properties--parameter estimation, exploration level, and gradient information--on the performance of response surface methods.