Background Major Depressive Disorder (MDD) is one of the most common psychiatric disorders, with a prevalence of ~15%. Diagnosis is often inaccurate, based on self-report and is largely under-diagnosed. Data-driven approaches to predict lifetime risk for MDD and single versus recurrent MDD, especially using a small number of highly-accurate predictors that could be easily collected in clinic, would be a step forward in realising the promises of personalised medicine. We applied machine learning algorithms (MLAs) to the Generation Scotland cohort study (N > 21,000), a deeply-phenotyped cohort. The cohort was first divided into a training (63%) and held-out test (37%) set for both lifetime and single versus recurrent MDD analyses. Methods In the training set, 10-fold cross-validation was used to estimate optimal hyperparameters for the following algorithms: Random Forest, Conditional Inference Forest, Gradient Descent Boosting (GBM), Support Vector Machines (with linear, polynomial and radial basis function kernels), Neural Networks, C5.0, Elastic Net and Forward Stepwise Regression. Using the optimal hyperparameters, we ran each MLA on the full training set, then ran the independent test data through the best model derived from the training data to predict case-control or single-recurrent status. Receiving Operator Characteristic Area Under the Curve (AUC) values were used to assess performance of the algorithms on the test data. To obtain a reduced set of predictors with optimised predictive value, the Markov Chain 4 (MC4) algorithm was used. MC4 is a rank aggregation algorithm, originally designed for meta-Internet search rankings. The aggregated rankings of predictors across all MLAs were then used to determine the smallest set with equal or better predictive value on the test data versus using all of the predictors. Results AUC values for the prediction of lifetime MDD were in the range of 0.81–0.84 for all MLAs; similarly, the AUC range for single versus recurrent MDD were between 0.69 and 0.76. All AUCs were significantly better than expected by chance (AUC = 0.5). Using the MC4 ranked variables and the best-performing model (GBM), we found that equal performance was obtained using only 20 variables for lifetime MDD (AUC = 0.84) compared to 155 variables in the full set, and 10 for single versus recurrent MDD (AUC = 0.76) compared to 180 variables in the full set. Discussion The MC4 ranked subsets performed equally well to the full subset, although it is likely other subsets exist that could perform similarly. The subsets identified for lifetime MDD included neuroticism, general psychological distress, age, income, family history of depression, living alone, sex, home ownership, smoking, education, pain status, mania and schizotypy. Age at MDD onset, neuroticism, psychological distress, measures of cognition, age, smoking and home ownership were the most relevant predictors for whether someone would have a single episode of MDD or recurrent episodes. These highly-predictive questionnaires and demographic information can be easily collected in clinic to assist in accurate diagnosis and preventative treatment for recurrent MDD.