Models predicting health care spending and other outcomes from administrative records are widely used to manage and pay for health care, despite well-documented deficiencies. New methods are needed that can incorporate more than 70 000 diagnoses without creating undesirable coding incentives. To develop a machine learning (ML) algorithm, building on Diagnostic Item (DXI) categories and Diagnostic Cost Group (DCG) methods, that automates development of clinically credible and transparent predictive models for policymakers and clinicians. DXIs were organized into disease hierarchies and assigned an Appropriateness to Include (ATI) score to reflect vagueness and gameability concerns. A novel automated DCG algorithm iteratively assigned DXIs in 1 or more disease hierarchies to DCGs, identifying sets of DXIs with the largest regression coefficient as dominant; presence of a previously identified dominating DXI removed lower-ranked ones before the next iteration. The Merative MarketScan Commercial Claims and Encounters Database for commercial health insurance enrollees 64 years and younger was used. Data from January 2016 through December 2018 were randomly split 90% to 10% for model development and validation, respectively. Deidentified claims and enrollment data were delivered by Merative the following November in each calendar year and analyzed from November 2020 to January 2024. Concurrent top-coded total health care cost. Model performance was assessed using validation sample weighted least-squares regression, mean absolute errors, and mean errors for rare and common diagnoses. This study included 35 245 586 commercial health insurance enrollees 64 years and younger (65 901 460 person-years) and relied on 19 clinicians who provided reviews in the base model. The algorithm implemented 218 clinician-specified hierarchies compared with the US Department of Health and Human Services (HHS) hierarchical condition category (HCC) model's 64 hierarchies. The base model that dropped vague and gameable DXIs reduced the number of parameters by 80% (1624 of 3150), achieved an R2 of 0.535, and kept mean predicted spending within 12% ($3843 of $31 313) of actual spending for the 3% of people with rare diseases. In contrast, the HHS HCC model had an R2 of 0.428 and underpaid this group by 33% ($10 354 of $31 313). In this study, by automating DXI clustering within clinically specified hierarchies, this algorithm built clinically interpretable risk models in large datasets while addressing diagnostic vagueness and gameability concerns.