Abstract Earthquake probability forecasts are typically based on simulations of seismicity generated by statistical (point process) models or direct calculation when feasible. To systematically assess various aspects of such forecasts, the Collaborative Studies on Earthquake Predictability testing center has utilized N- (number), M- (magnitude), S- (space), conditional likelihood-, and T- (Student’s t) tests to evaluate earthquake forecasts in a gridded space–time range. This article demonstrates the correct use of point process likelihood to evaluate forecast performance covering marginal and conditional scores, such as numbers, occurrence times, locations, magnitudes, and correlations among space–time–magnitude cells. The results suggest that for models that only rely on the internal history but not on external observation to do simulation, such as the epidemic-type aftershock sequence model, test and scoring can be rigorously implemented via the likelihood function. Specifically, gridding the space domain unnecessarily complicates testing, and evaluating spatial forecasting directly via marginal likelihood might be more promising.