Abstract Over the past decade, there has been a rapid increase in the development of predictive models at the intersection of molecular ecology, genomics, and global change. The common goal of these ‘genomic forecasting’ models is to integrate genomic data with environmental and ecological data in a model to make quantitative predictions about the vulnerability of populations to climate change. Despite rapid methodological development and the growing number of systems in which genomic forecasts are made, the forecasts themselves are rarely evaluated in a rigorous manner with ground‐truth experiments. This study reviews the evaluation experiments that have been done, introduces important terminology regarding the evaluation of genomic forecasting models, and discusses important elements in the design and reporting of ground‐truth experiments. To date, experimental evaluations of genomic forecasts have found high variation in the accuracy of forecasts, but it is difficult to compare studies on a common ground due to different approaches and experimental designs. Additionally, some evaluations may be biased toward higher performance because training data and testing data are not independent. In addition to independence between training data and testing data, important elements in the design of an evaluation experiment include the construction and parameterization of the forecasting model, the choice of fitness proxies to measure for test data, the construction of the evaluation model, the choice of evaluation metric(s), the degree of extrapolation to novel environments or genotypes, and the sensitivity, uncertainty and reproducbility of forecasts. Although genomic forecasting methods are becoming more accessible, evaluating their limitations in a particular study system requires careful planning and experimentation. Meticulously designed evaluation experiments can clarify the robustness of the forecasts for application in management. Clear reporting of basic elements of experimental design will improve the rigour of evaluations, and in turn our understanding of why models work in some cases and not others.