Estimating treatment effects is one of the most challenging and important tasks of data analysts. Personalized medicine, digital marketing, and many other applications demand an efficient allocation of scarce treatments to those individuals who benefit the most. Uplift models support this allocation by estimating how individuals react to a treatment. A major challenge in uplift modeling concerns evaluation. Previous literature suggests methods like the Qini curve and the transformed outcome mean squared error. However, these metrics suffer from variance: their evaluations are strongly affected by random noise in the data, which renders their signals, to a certain degree, arbitrary. We theoretically analyze the variance of uplift evaluation metrics and derive possible methods of variance reduction, which are based on statistical adjustment of the outcome. We derive simple conditions under which the variance reduction methods improve the uplift evaluation metrics and empirically demonstrate their benefits on simulated and real-world data. Our paper provides strong evidence in favor of applying the suggested variance reduction procedures by default when evaluating uplift models on RCT data.