The availability of large-scale datasets and sophisticated machine learning tools enables developing models that predict treatment outcomes for individual patients. However, few studies used routinely available sociodemographic and clinical data for this task, and many previous investigations used highly selected samples. This study aimed to investigate cognitive behavioral therapy (CBT) outcomes in a large, naturalistic and longitudinal dataset. Routine data from a university-based outpatient center with n = 2.147 patients was analyzed. Only baseline data including sociodemographics, symptom measures and functional impairment ratings was used for prediction. Different competing classification and regression models were compared to each other; the best models were then applied to previously unseen validation data. Applied on the validation set, the best performing classification model for remission achieved a balanced accuracy of 59% (p < 0.001) and the best performing regression model for dimensional change achieved r = 0.27 (p < 0.001). Age, sex, functional impairment, symptom severity, and axis II comorbidity were among the most important features. Predictor performances significantly exceeded chance level but were far from clinical utility. Neither applying more sophisticated approaches nor restricting the sample to homogeneous subgroups resulted in considerable performance gains. Adding hypotheses-based, more specific clinical constructs and deep (e.g. neurobiological) to digital phenotypes may increase prediction performance.