Background Obtaining sufficient statistical power has been a key issue in the study of genetics in depression; only recently have samples become large enough to identify any genome-wide significant hits for this phenotype. Researchers have used a range of different approaches to tackle the issue of balancing sample size with phenotypic depth, and here we evaluate the statistical power achieved with these different methods. Methods We collated details of published genome-wide association studies exploring major depression or depressive symptoms. For each study we assessed sampling methods, phenotyping criteria and sample size, and use GPC (Genetics Power Calculator, Purcell et al. 2003) and extension methods (Yang et al. 2010, Traylor et al. 2015) to calculate statistical power, and compare approaches within the literature. We compared the statistical power (as captured by the non-centrality parameter) of each cohort across a range of allele thresholds and genetic effect sizes, contrasting different methodologies of cohort ascertainment within the depression literature. We evaluate how statistical power is affected by the selection of potentially more homogenous but less prevalent depression phenotypes, the potential impact of misclassification in large cohorts which have employed brief phenotypic screening questions, and the value of dimensional measures of depression symptoms in population-based cohorts. Results We confirm the findings of Traylor et al. apply in the case of depression; that in the analysis of specific subtypes of depression performed in datasets including PGC2 (2013, secondary analysis of recurrent, early-onset recurrent and typical-like depression, observed in approximately 75%, 50% and 40% of cases respectively) and CONVERGE (2015, secondary analysis of melancholia observed in approximately 85% of cases), very modest increases in genotypic relative risk (GRR) are required to achieve equivalent power with the whole sample analyses, particularly for less common risk allele frequencies and lower index GRR. We also compare the statistical power of the PGC2 case-control sample (discovery sample n=18,759) and the study from the CHARGE Consortium (discovery sample n= 34,549, Hek et al. 2013) looking at depression symptoms, observing that with a sample size which is 54% of that in CHARGE, the PGC2 dataset has less than half the statistical power to detect genetic effects. Discussion Given range of approaches, here we give evaluation of expected statistical power across a range of approaches specific to the field of major depression genetics. We compare the statistical power achieved in the current literature and evaluate how different phenotypic approaches affect this. If the ultimate aim of our research is to improve outcomes for patients with depression, our phenotypic definitions must be informed by clinical relevance. But nevertheless, an understanding of how statistical power is impacted by these definitions is key to optimal study design for the identification of the genetic variants associated with depressive disorders.